What Is Tokenization? (And Why It Matters for Everyday AI)
If you’re playing with AI and LLMs, you’ve probably seen this viral post somewhere:
Try asking this question to ChatGPT: “How many ‘r’s are in the word ‘strawberry’?” You’ll see ChatGPT respond with “2” although there are 3 ‘r’s in the word.
Ever wondered why? I did too, and what I discovered changed how I think about every AI interaction.
This isn’t about AI being bad at counting. It’s about something more fundamental that explains a lot of those weird behaviors you’ve probably noticed but couldn’t quite put your finger on.
Before diving into the strawberry mystery, let’s talk about the “context window”: the memory limit of large language models. Imagine the AI’s brain as a glass, it can only hold so much water at once. In the same way, an LLM can only “remember” a certain amount of information at a time, measured not in words or sentences, but in tokens — the small pieces that make up our input and the AI’s responses.
If you try to pour in too much, the glass overflows and the earliest drops spill out. For AI, this means if your conversation or document is too long, the model loses track of what came first, it simply can’t fit everything into its working memory. That’s why understanding what fits in this context window is so important for getting the best results from AI.
You can read more about context window in this post.
What I Wish I’d Understood About Tokenization Earlier
Tokenization is how AI breaks down your text before it can understand anything you’re saying. Instead of reading full sentences like humans do, AI needs to slice everything into smaller pieces called tokens.
But here’s what took me way too long to internalize: this isn’t just a preprocessing step. This fundamentally shapes every interaction you have with AI.
Think of it like packing a lunchbox. You can’t just throw in whole meals, everything needs to be cut into bite-sized pieces first. But once you’ve chopped everything up, you can never see the original whole sandwich or apple. You only see the individual chopped pieces.
What are tokens? They can be:
- Individual words (“hello”)
- Parts of words (“un- or -ing”)
- Single characters
- Punctuation marks
- Spaces
Why Can’t AI Just Read Normal Text?
Machines don’t understand words the way humans do. LLMs are fundamentally number-crunching systems built on neural networks,they operate on numerical data, not raw text.
Here’s what clicked for me: every conversation you have with AI is actually a conversation with a very sophisticated calculator.
The process works like this:
- Your input: “Chatbots are helpful.”
- Tokenization: Breaks it into [“Chatbots”, “are”, “helpful”, “.”]
- Numerical mapping: Each token gets a unique number ID
- AI processing: The model works with these numbers
- Response generation: AI outputs number sequences
- Decoding: Numbers get converted back to readable text
This number conversion isn’t just a technical detail — it’s why AI behaves the way it does.
How Modern Tokenization Actually Works
Most current LLMs use subword tokenization, which is smarter than just splitting by spaces. Instead of treating every word as one token, it breaks things down further:
- “unhappiness” becomes [“un”, “_happi”, “_ness”]
- “tokenization” becomes [“token”, “_ization”]
- “internationalization” becomes [“inter”, “_national”, “_ization”]
This helps AI handle rare words and different languages more efficiently by recognizing familiar word parts.
Now, you might be thinking “okay, that’s interesting, but why should I care about the technical details?”. Here’s the thing, once you understand how this tokenization process actually works, suddenly all those weird AI behaviors you’ve probably noticed start making perfect sense.
Why This Matters for Your Daily AI Use
Understanding tokenization isn’t academic, it directly impacts your experience with AI tools. Once I understood this, suddenly all those weird AI behaviors started making sense.
Prompt Limits and Cut-off Responses
Every LLM has a token limit that covers both your input and the AI’s response. Hit this limit, and:
- Responses get truncated mid-sentence
- Earlier conversation context gets lost
- The AI might produce errors or inconsistencies
You know that frustrating moment when ChatGPT suddenly stops mid-sentence? It’s not being dramatic, it literally can’t fit any more tokens in its context window.
Token Counting Isn’t Intuitive
Here’s something that caught me off guard: not all text is created equal in token terms.
You might think “artificial intelligence” and “AI stuff” would use roughly the same number of tokens since they’re similar length. But “artificial intelligence” gets split into [“artific”, “ial”, “intellig”, “ence”] (4 tokens), while “AI stuff” is just [“AI”, “stuff”] (2 tokens).
This matters because:
- On average, 1 English word ≈ 1.3 tokens, but technical terms can be much higher
- That fancy vocabulary you’re using? It’s costing you more tokens than simple words
- Every emoji 🤖 might be 2-3 tokens, so your friendly prompts are more expensive than you think
- Even spaces and punctuation marks count as separate tokens
So when you hit a token limit, it’s not necessarily because your prompt was too long—it might be because you used too many uncommon words.
Why Your Non-English Prompts Feel More Expensive
If you’ve ever noticed AI seems to work better in English, tokenization is part of the reason why.
Let’s say you want to ask “How are you?” In English, that’s [“How”, “are”, “you”, “?”] - 4 tokens. But “তুমি কেমন আছ?” (tumi kemon acho?) in Bengali becomes a mess of individual characters and syllables - often 8-12 tokens for the same simple question.
This isn’t just about cost:
- Your Bengali prompts hit token limits faster
- The AI has less “room” to give detailed responses in Bengali
- Non-English conversations feel more constrained because they literally are
If you’re multilingual, you’ve probably noticed ChatGPT feels different in different languages. Now you know part of why.
The Strawberry Mystery Solved
Remember that strawberry counting problem? Here’s what’s actually happening—and it’s kind of brilliant once you see it.
When you type “strawberry,” the AI doesn’t see s-t-r-a-w-b-e-r-r-y. Instead, it sees three separate chunks: “str” + “aw” + “berry”. To the AI, these aren’t even letters anymore, they’re just number codes: [496, 675, 15717].
So when you ask “how many r’s are in strawberry,” the AI is essentially thinking:
“Let me count the r’s in number 496… okay, one r. Now in number 675… no r’s here. Now in number 15717… one r. Total: two r’s.”
It’s not that the AI is bad at counting. It’s that from its perspective, it’s never actually seeing the word “strawberry” as a complete sequence of letters. It’s like asking someone to count the number of times the letter ‘e’ appears in a sentence, but you’ve handed them a bunch of puzzle pieces instead of the full sentence.
This blew my mind when I realized it explains so many other weird AI behaviors.
When This Knowledge Actually Saves You Time and Money
Understanding tokenization isn’t just interesting, it solves real problems you’re probably having right now.
That Moment When Your API Bill Spikes
If you’re using AI APIs, you pay per token. I learned this the hard way when my bill jumped 40% one month. Turns out, I’d started using more technical jargon in my prompts. “Utilize sophisticated algorithms” costs 6 tokens, while “use smart code” costs 3 tokens for basically the same meaning.
Why Your Long Conversations Get Weird
Ever notice how ChatGPT starts giving inconsistent answers in really long conversations? You’re hitting the context window limit. The AI literally can’t remember what you talked about 20 messages ago because those tokens got pushed out to make room for new ones.
The Frustrating Mid-Sentence Cutoff
That moment when the AI stops responding mid-thought? It hit its token limit. Understanding this helped me structure my prompts to leave more room for complete responses.
Why Some Prompts Just Work Better
Better understanding of how AI processes text leads to more predictable outcomes. Instead of wondering why one prompt works and another doesn’t, you can start to see the patterns.
Privacy considerations
While tokenization itself doesn’t create privacy risks, the content being tokenized might. Personal information in training data or prompts can potentially lead to privacy breaches, the risk comes from the sensitive content, not the tokenization process itself.
The Bottom Line
Every interaction with an LLM starts and ends with tokenization. Your text gets broken into tokens, converted to numbers, processed by the AI, then decoded back into readable text.
Think of it like packing a lunchbox for AI. You’ve got a big sandwich, grapes on stems, and whole carrots. You can’t just throw them in: you need to slice the sandwich, separate the grapes, and cut the carrots into sticks.
Tokenization works the same way. Your text is the whole meal, the tokenizer is your knife, and tokens are the bite-sized pieces that fit into the AI’s “lunchbox”. The AI needs these discrete, numbered pieces to understand and process your input, and it can never reconstruct the original whole from these pieces.
Tokenization-Aware Prompting: Practical Tips
Understanding tokenization can make you a more effective prompt writer. Here are key considerations:
Case Sensitivity Matters More Than You Think
Here’s something that bit me while writing prompts: “Swift”, “swift”, and “SWIFT” are completely different tokens to the AI.
I was writing prompts like “Use SWIFT to solve this” for emphasis, not realizing the AI had to work harder to connect “SWIFT” with all its training about “Swift” programming. Plus, “SWIFT” could just as easily refer to the banking protocol, adding another layer of confusion. My prompts were less effective because I was accidentally using unfamiliar tokens.
What I do now: Stick to standard capitalization unless I specifically need emphasis. “Use Swift to solve this” works better than “Use SWIFT to solve this” because the AI has seen that exact token pattern thousands of times in its training.
The AI isn’t reading for meaning the way you are, it’s pattern matching tokens it’s seen before.
Why Grammar Actually Affects AI Performance
I used to be pretty casual about typos in my prompts - figured the AI was smart enough to figure out what I meant. Although this is mostly true, I started noticing my quick, typo-filled prompts were sometimes getting oddly inconsistent responses.
When you write “Can you help me optimize this algoritm?”, the AI sees [“Can”, “you”, “help”, “me”, “optimize”, “this”, “algoritm”, “?”]. That misspelled “algoritm” token is unfamiliar — the AI has seen “algorithm” thousands of times, but “algoritm” barely exists in its training.
The difference is measurable: Well-spelled, grammatically correct prompts get more accurate responses because they use token patterns the AI recognizes. It’s not about being formal; it’s about speaking the AI’s language more fluently.
This is why that perfectly clear prompt you typed quickly on your phone sometimes gets weird responses. Those typos aren’t just cosmetic, they’re creating token confusion.
Format Choice Affects Token Efficiency
Different data formats consume vastly different numbers of tokens:
When converting the same data to different formats: JSON used 13,869 tokens, YAML used 12,333 tokens, and Markdown used only 11,612 tokens — a 15% improvement over JSON.
Type definitions use 60% fewer tokens than JSON schemas with no loss of information. For example, instead of complex JSON schema syntax, simple type definitions like { "name": string, "age": int }
are much more token-efficient.
Best practice: For structured output, prefer YAML or Markdown over JSON when possible. Use simple type definitions instead of verbose JSON schemas.
Special Characters and Symbols
Special characters, punctuation, and symbols can be tokenized unpredictably. Each punctuation mark might be its own token, and complex symbols might be split into multiple tokens. Even trailing spaces can create unexpected tokens — a space at the end of your prompt might affect how the AI interprets what comes next.
Some characters you use might not exist in the AI’s vocabulary at all. When this happens, they get converted to an “unknown” ([UNK]
) token (think of it like the AI saying “I don’t recognize this symbol”), which means it loses the specific meaning you intended.
Best practice: Use standard punctuation and avoid unnecessary special characters. Watch out for extra spaces at the end of your prompts.
Coherence Over Brevity
Non-English languages often require significantly more tokens for the same content. For example, “Cómo estás” in Spanish contains 5 tokens for just 10 characters. This leads to higher costs and potentially worse performance.
Best practice: When working with non-English content, budget for higher token usage and test thoroughly.
Understanding AI’s Special Vocabulary
AI models have special “control tokens” built into their vocabulary — like invisible punctuation marks that help organize conversations. These include tokens that signal “start of text,” “end of response,” or “unknown word.” You can’t see these in normal use, but they’re working behind the scenes to help the AI understand structure and context.
This is why AI sometimes behaves differently when you format text in certain ways, you might accidentally trigger patterns the AI associates with these special structural markers.
Best practice: Stick to natural, conversational language. Avoid trying to mimic technical formatting unless you specifically need it. While shorter prompts use fewer tokens, overly compressed prompts can reduce response quality. Give the model room to think.
Best practice: Be concise but complete. Don’t sacrifice clarity to save a few tokens.
Every interaction with an LLM starts and ends with tokenization. Your text gets broken into tokens, converted to numbers, processed by the AI, then decoded back into readable text.
Understanding this process helps you:
- Write more effective prompts
- Anticipate when you’ll hit limits
- Understand why AI sometimes struggles with certain tasks
- Manage costs if you’re using AI APIs
- Get better results from AI tools
Tokenization is the invisible translator between human language and machine understanding. The better you understand it, the better you’ll work with AI.
Want to Dig Deeper?
If this sparked your curiosity about tokenization (like it did for me), here are the resources that helped me piece this together:
Start here if you’re new to this: Grammarly’s guide does a solid job explaining the basics without getting too technical.
For the technical details: Christopher’s deep dive is where I finally understood the subword
tokenization mechanics that I glossed over here.
The case sensitivity thing that surprised me: Mark’s analysis has the data on how much capitalization actually affects AI performance.
Official documentation: OpenAI’s token guide is dry but accurate if you want the authoritative explanation.
Token optimization tricks: This BoundaryML post opened my eyes to how much tokens I was wasting with verbose prompting patterns.