Context, compression, and memory

What you will take away from this lesson

The previous lesson covered tokens as a unit of consumption. This lesson covers the place they all live: the context window. It is the model's entire working memory for a single request, it has a ceiling, and every long session is a small balancing act between keeping that memory, compressing it, or clearing it and starting fresh.

The point of the lesson is not to turn you into a context engineer. Plexara and your client handle most of this automatically. The point is to recognize what the three plays are, so you know which one you are making.

Learning Objectives

01Describe what lives inside a single request's context window and why the model treats the window as one undivided budget.
02Read a provider's context-window number and translate it into a practical capacity.
03Recognize that context is a form of memory, and that every long session is a balancing act.
04Choose between the three plays when a context is filling up: keep using it, let it compress, or clear it and start fresh.
05Understand how frontier clients already ship some memory and how Plexara's enterprise memory layer extends that with semantic search for re-injecting context into a new session.

What a context window actually is

A context window is the maximum number of tokens a model can read and write inside a single call. It is not separate buckets for prompt, history, tools, and response. It is one budget. Every layer of the conversation lives in the same budget and competes for the same space.

This matters because the window is shared. A long system prompt leaves less room for conversation history. A verbose tool result leaves less room for the model's reasoning. A detailed response leaves less room for whatever tool call follows it. The model treats all of it as one sequence of tokens.

Everything the model reads and writes on one request lives in one budget

System prompt
~500 tokens
The baseline instructions your client sets for the model. Small and stable.
platform_info (Plexara operating manual)
~6K tokens
Loaded once per session. Teaches the agent how to use the MCP correctly.
Prior conversation turns
Grows with session
Every question and every answer in the session so far. Grows monotonically.
Retrieved context
Varies per turn
Catalog results, query templates, memory recall, and semantic enrichment attached to tool output.
Tool results
Varies per turn
Rows from trino_query, objects from S3, lineage from DataHub, and so on. The most variable line.
Current user message
20 to 200 tokens
Your current question.
Model reasoning and response
1K to 5K tokens
What the model writes, including any intermediate reasoning. Counts against the same window.

The model does not distinguish these layers the way you do. It sees one sequence of tokens and has one budget. Going over the budget means something has to leave.

How big is a context window in 2026

Specific numbers change roughly quarterly as providers ship new tiers. As of April 2026 the confirmed top-end on Claude is Claude Opus 4.7 at one million tokens. The industry baseline on frontier tiers sits at 200,000 tokens, itself a hundred-fold increase over the original GPT-3 six years ago. The trend is real growth, but growth has slowed as provider focus has shifted to reasoning quality over raw capacity.

Practical rule: pick the right tier for the work, but do not rely on knowing the exact current number. Plan pages get updated more often than documentation. Specific sizes and tier availability should be confirmed against the provider's current model page.

Context capacity, April 2026

Claude Opus 4.7 (April 2026)
1,000,000 tokens
One million tokens. Fits a mid-sized codebase, a long contract and its exhibits, or a multi-hour meeting transcript in one request.
Baseline frontier tier
200,000 tokens
The common industry baseline as of 2026 across Claude, Gemini, GPT, and comparable models. Enough for a long document or a full working session.
Original GPT-3 (2020)
2,048 tokens
For scale. Six years ago, frontier context was about two pages. A 500x expansion has redefined what is possible.

Specific window sizes vary by provider and by plan tier and are updated several times a year. Consult the provider's model page for your plan before making sizing decisions.

Context is memory, and memory fills up

The practical framing that matters most: the context window is the model's working memory. Everything you have discussed in this session so far, every tool result, every decision, every artifact that the model is still aware of, lives in the same budget. When the budget fills up, something has to happen. Continuing to chat past the ceiling is not an option.

This turns the mechanical question "how big is my window" into a practical one: "how am I going to manage my working memory across a long session." There are three plays, and each is right in different situations.

Play 1

Keep using

When: The context is valuable and still fits
What happens: Do nothing. Stay in the same session. The model keeps full recall of everything you and it have said and done so far.
Tradeoff: Every turn costs slightly more than the last, and a session that runs long enough will eventually approach the ceiling.

Play 2

Let it compress

When: The context is approaching the ceiling but the work is still live
What happens: Your client automatically summarizes older turns to make room. Recent turns stay verbatim; early turns get condensed.
Tradeoff: You keep the thread but lose fidelity on the oldest turns. Important early decisions should be surfaced again or committed to memory before they compress.

Play 3

Clear and start fresh

When: You are moving on to something else, or the session has accumulated stale state
What happens: Start a new session. Context cost resets to zero. The agent starts clean.
Tradeoff: Anything that lived only in the conversation is gone. This is where memory and semantic search pay off: Plexara can recall the right prior memories into the new session so you do not start from nothing.

Three moves. Which one is right depends entirely on whether you still need the current conversation. The fourth option, doing nothing past the ceiling, is the one to avoid.

What automatic compression actually does

The middle play, letting the client compress, is the one most people encounter without naming it. Every major client begins trimming and summarizing older turns once a conversation approaches the ceiling. It happens silently, and the behavior varies between clients, but the pattern is the same: recent turns stay verbatim, early turns lose fidelity, and the model is left reasoning over a summary of its own history instead of the full record.

This is usually fine for ongoing work where the early turns are exploratory scaffolding. It is not fine when a decision or a definition from early in the session is load-bearing for what you are doing now. If something matters long enough to survive an hour of conversation, it should be written to memory, not left in the chat to be compressed.

What automatic compression actually looks like as the window fills

1
Well under capacity
The full conversation sits in the context window as-is. The model sees everything you and it have ever said in this session.
Tradeoff: No tradeoffs. This is the best state to work in.
2
Approaching capacity
Most clients begin compressing older turns. Early messages get replaced with summaries; tool results get trimmed. Different clients do this differently and usually silently.
Tradeoff: Facts from early in the conversation may degrade or disappear. The model might forget a task definition you gave it an hour ago.
3
Over capacity
Something has to leave. Without intervention the client drops older turns outright or refuses new input. This is the point at which continuity of the session is no longer guaranteed.
Tradeoff: Any fact, decision, or artifact that was only in the conversation is now in jeopardy. The right move here is usually to clear and rely on memory for re-injection.

What lives past the context window: two memory layers

Modern frontier clients have started shipping some memory of their own. Claude, ChatGPT, and similar clients will remember small facts about you personally across sessions inside that client. This is useful for preferences and long-running style notes. It is not built for enterprise data and is not governed by your company's policies.

Plexara's memory subsystem sits above client-side memory and is scoped to a workspace. It is semantically searchable, governed by access controls, and connected to the catalog, so a brand-new session can retrieve the relevant prior memories, facts about your data, decisions made yesterday, curated artifacts, and re-inject them as context without you having to name them. This is covered in detail in the 207 lesson.

Layer

Client-side memory

Scope: Per user, per client (Claude, ChatGPT, and comparable clients ship this)
Good at: Personal preferences, long-running style notes, "remember I work in Python" kinds of facts. Surfaces automatically inside the same client.
Not for: Enterprise data. Not governed by your policies, not connected to your catalog, not shared across teammates, not searchable by topic.

Layer

Plexara enterprise memory

Scope: Per workspace or team, governed by your access controls
Good at: Facts about your data, decisions made in prior sessions, curated artifacts, glossary terms, and anything else that needs to survive across a work week. Semantically searchable, so a new session can retrieve the relevant memories without asking for them by name.
Not for: Raw secrets or credentials. Memory is a recall layer, not a vault.

Putting it all together

Use context while it fits. Let it compress if the thread is still live but the window is getting tight. Clear and start fresh when you are moving to something else, and trust memory to bring back what matters. The balancing act is small once you have seen the three plays side by side.

Key terms

Six terms cover most of the vocabulary you will see in client documentation and in discussions about session strategy.

Key Terms

Context window: The per-request token budget a model can read and write inside a single call. Everything the model considers must fit here.
Working memory: What a model holds inside a single request. The context window is working memory: present in the moment, gone at the next request unless something else preserves it.
Context compression: Client-side behavior that summarizes, truncates, or drops older turns when a conversation approaches the context window ceiling. Quiet and implementation-specific.
Clear and restart: Starting a new session instead of letting the current one fill up. Resets context cost to zero. Requires memory to re-inject anything you still need.
Client-side memory: Memory that the AI client itself (Claude, ChatGPT, and similar) stores about a user, surfaced automatically in later sessions inside that client.
Plexara enterprise memory: A semantically searchable, governed memory layer scoped to a workspace or team. Lets a fresh session retrieve relevant prior memories to re-inject as context. Covered in detail in 206.

103 - Context, compression, and memory