3 minute read

LLMs are now pretty mainstream. You’re probably using them every day in some form or another. They have also probably been your doorway to AI. If you’ve spent any time with LLMs like ChatGPT, at some point your experience probably went through the following emotional stages:

  1. Amazement:This is incredible. Life just got easier. It can literally do anything.
  2. Confusion:Wait, what’s happening? The responses are off, repetitive, or totally drifting.
  3. Frustration:This is trash. It forgets everything, goes off-topic, contradicts itself, or just hallucinates.
  4. Denial:AI is over-hyped. It’s not what they claim. It’s making things worse. Maybe Model X is better than Model Y.

Here’s the deal, though. Most of that frustration comes from not truly understanding how tokens and context windows work. This applies across the board, whether you’re using GPT-4, Claude, Gemini, or platforms like ChatGPT, Cursor, or Windsurf.

What’s a Token?

A token is how a model breaks down your prompt to process it. According to OpenAI, roughly one token equals about four characters of English text, but it varies. Words, punctuation, spaces, and special characters all count. For example, “ChatGPT is great!” breaks down into six tokens: [“Chat”, “_ G”, “PT”, “_ is”, “_ great”, “!”]. Sometimes even a period or space can be its own token, depending on the tokenizer.

So, Why Does This Matter?

Every model has a context window, its working memory. This isn’t its knowledge base; it’s the maximum number of tokens (input + output + system prompts) the model can “remember” at one time.

Think of it like a water glass. If the context window is 100 ml water and each token is 1 ml water drops, that water glass can only hold 100 tokens. As you add tokens, you get closer to the limit. When you hit the ceiling, the model starts forgetting earlier parts of the conversation, loses coherence, or just starts hallucinating.

At the jump, the glass is empty—that’s the “amazement” phase. As the conversation flows, tokens fill it up. Nearing the top, “confusion” creeps in. Once it’s practically full, you hit “frustration”: you’re repeating context, correcting drift, fighting hallucinations. When it overflows, you’re in “denial”, often without even realizing the model simply hit its memory cap.

A Few Technical Clarifications:

  • The context window is measured in tokens, not words or characters.
  • Your prompt (input) and the model’s response (output) both count towards the token limit.
  • System prompts (those instructions running in the background) also eat into the limit.
  • Most current LLM context windows range from 16k tokens (GPT-3.5 Turbo) to 128k (GPT-4) or 200k (Claude 3.7). Some, like Gemini 2.5, claim up to 1 to 2 million tokens, but real-world limits can differ.
  • For different applications or variations of a model (e.g., Cursor or GPT-4o), output is typically capped (e.g., at 4k tokens), even if the overall context window is much larger.
  • When nearing the limit, models might try to compress or summarize earlier conversation, but this is always lossy. The more you pack that window, the higher the chance the model loses track, forgets context, or hallucinates.

Everything counts toward the token limit: your prompt, the model’s response, and those hidden system prompts. If your prompt’s too long, or the model gets too chatty, you’re gonna hit problems.

This is a core principle in prompt engineering. There’s way more to unpack, like system vs. user prompts, different prompt strategies, and temperature settings, which I’ll dig into in future posts.

With AI growing so fast and hyped-up ideas like Model Context Protocols (MCP), RAG, and agent frameworks flying around, it’s easy to get swept up. But understanding the basics, like tokens and context, is absolutely essential.

LLMs are just one piece of the AI puzzle. They matter, for sure, but there’s a much bigger picture out there.

References

  • Kolena. “LLM Context Windows: Why They Matter and 5 Solutions for Context Limits.” Kolena Guides.
  • DEV Community. “Context Windows in Large Language Models.” DEV.to. Raga.ai. “Solving LLM Token Limit Issues: Understanding and Approaches.” Raga Blog.