Utilix knowledge base
Context Windows Explained — Tokens, Limits, and Long Prompts
Published May 3, 2026
Every large language model has a context window: the maximum number of tokens it can process in a single request. Understanding this limit prevents unexpected errors and helps you design applications that handle long conversations gracefully.
What Is a Context Window?
The context window is the model's working memory for a single request. It includes everything the model can "see" at once:
- System prompt
- Conversation history (all prior turns)
- Current user message
- The model's own generated response
The total of all these cannot exceed the context window limit.
Context Window Sizes by Model
| Model | Context window (tokens) | Approximate equivalent |
|---|---|---|
| GPT-4o | 128,000 | ~100,000 words (~200 pages) |
| GPT-4 Turbo | 128,000 | ~100,000 words |
| Claude Sonnet 4 | 200,000 | ~150,000 words (~300 pages) |
| Claude Opus 4 | 200,000 | ~150,000 words |
| Gemini 1.5 Pro | 1,000,000 | ~750,000 words (~1,500 pages) |
| Gemini 1.5 Flash | 1,000,000 | ~750,000 words |
For most chatbot and tool use cases, 128K tokens is more than enough. Long-document analysis (legal contracts, codebases, research papers) benefits from larger windows.
What Happens When You Exceed the Limit
When the total tokens exceed the context window, you have two options:
- The API returns an error — Some models refuse requests that exceed the limit entirely.
- Older content is dropped — Some implementations automatically truncate the beginning of the context when it is too long.
In a chat application without truncation logic, a very long conversation will eventually fail or start losing early context.
Input Tokens vs Output Tokens in the Context
The context window typically applies to input tokens plus output tokens combined. The model cannot generate a response longer than context_window − input_tokens.
If you send 120,000 tokens to GPT-4o (128K window), the model can only generate up to 8,000 tokens in response.
Practical Limits in Chat Applications
Even models with large context windows become slow and expensive as context grows:
- Latency increases with context length (the model must attend to all tokens)
- Cost scales linearly with input tokens
For chat applications, a practical strategy is to keep context under 10,000–20,000 tokens unless you have a specific reason to go larger.
Retrieval-Augmented Generation (RAG)
Rather than stuffing an entire knowledge base into the context window, RAG systems retrieve only the relevant chunks:
- Embed the knowledge base into a vector database
- At query time, retrieve the top-N relevant chunks
- Insert only those chunks into the prompt
This keeps context small and cost low even when the underlying knowledge base is enormous.
Tokens Within Output
The model's own generated output consumes context window space. For very long outputs (detailed reports, long code files), you may need to explicitly allow for output length in your context budget.
Some providers let you set a max_tokens parameter to cap output length and prevent runaway generation costs.