Context Windows Explained — Tokens, Limits, and Long Prompts

Every large language model has a context window: the maximum number of tokens it can process in a single request. Understanding this limit prevents unexpected errors and helps you design applications that handle long conversations gracefully.

What Is a Context Window?

The context window is the model's working memory for a single request. It includes everything the model can "see" at once:

System prompt
Conversation history (all prior turns)
Current user message
The model's own generated response

The total of all these cannot exceed the context window limit.

Context Window Sizes by Model

Model	Context window (tokens)	Approximate equivalent
GPT-4o	128,000	~100,000 words (~200 pages)
GPT-4 Turbo	128,000	~100,000 words
Claude Sonnet 4	200,000	~150,000 words (~300 pages)
Claude Opus 4	200,000	~150,000 words
Gemini 1.5 Pro	1,000,000	~750,000 words (~1,500 pages)
Gemini 1.5 Flash	1,000,000	~750,000 words

For most chatbot and tool use cases, 128K tokens is more than enough. Long-document analysis (legal contracts, codebases, research papers) benefits from larger windows.

What Happens When You Exceed the Limit

When the total tokens exceed the context window, you have two options:

The API returns an error — Some models refuse requests that exceed the limit entirely.
Older content is dropped — Some implementations automatically truncate the beginning of the context when it is too long.

In a chat application without truncation logic, a very long conversation will eventually fail or start losing early context.

Input Tokens vs Output Tokens in the Context

The context window typically applies to input tokens plus output tokens combined. The model cannot generate a response longer than context_window − input_tokens.

If you send 120,000 tokens to GPT-4o (128K window), the model can only generate up to 8,000 tokens in response.

Practical Limits in Chat Applications

Even models with large context windows become slow and expensive as context grows:

Latency increases with context length (the model must attend to all tokens)
Cost scales linearly with input tokens

For chat applications, a practical strategy is to keep context under 10,000–20,000 tokens unless you have a specific reason to go larger.

Retrieval-Augmented Generation (RAG)

Rather than stuffing an entire knowledge base into the context window, RAG systems retrieve only the relevant chunks:

Embed the knowledge base into a vector database
At query time, retrieve the top-N relevant chunks
Insert only those chunks into the prompt

This keeps context small and cost low even when the underlying knowledge base is enormous.

Tokens Within Output

The model's own generated output consumes context window space. For very long outputs (detailed reports, long code files), you may need to explicitly allow for output length in your context budget.

Some providers let you set a max_tokens parameter to cap output length and prevent runaway generation costs.