Context Windows, Memory, and Why Your AI Agent Forgets Everything
Your AI agent's biggest weakness isn't intelligence — it's memory. Here's how context windows work, why they cause amnesia, and what systems like Cortex do to fix it.

You've been working with your AI agent for three weeks. You've told it about your company's strategy, your key accounts, your preferences, your team's quirks. Then you start a new session and it asks, "What does your company do?"
This isn't a bug. It's architecture. And until you understand why it happens, you'll keep getting burned by AI tools that seem smart in the moment and amnesiac by the next morning.
What is a Context Window?
Every large language model has a context window: the maximum amount of text it can process in a single interaction. Think of it as the model's working memory, everything it can "see" and reason about at once.
Context window sizes have grown dramatically:
- GPT-3 (2020): 4,096 tokens (~3,000 words)
- GPT-4 (2023): 8,192 tokens, with a 32K variant
- Claude 3 (2024): 200,000 tokens (~150,000 words)
- Claude Opus 4.6 (2026): 200K standard, 1,000,000 tokens in beta
- GPT-5.4 (2026): 1,000,000 tokens (experimental)
- Gemini 3.1 Pro (2026): 1,000,000 tokens input, 65K output
At first glance, 200K tokens seems like plenty. That's a short novel. Why would memory be a problem?
Because context windows are expensive, ephemeral, and deceptively limited.
The Three Problems with Context Windows
Problem 1: They're Expensive
Every token in the context window costs money. With GPT-4-class models, input tokens cost $2.50-10 per million. If you stuff 100K tokens of context into every API call, you're paying $0.25-1.00 per interaction just for context, before the model generates a single word of output.
For an AI operator that processes hundreds of interactions daily, context costs can exceed the inference costs. This creates a perverse incentive: the more context you provide (and the smarter the agent could be), the more expensive it gets.
Problem 2: They're Ephemeral
Here's the part that surprises most people: the context window resets with every API call. When your conversation with ChatGPT "remembers" what you said earlier, it's because the application is re-sending the entire conversation history with each new message. The model itself has no memory. It's re-reading the conversation from the beginning every single time.
This means:
- Start a new session? Context is gone.
- Exceed the window size? Oldest messages get dropped.
- Switch models or providers? Everything resets.
- Server restarts? Clean slate.
Your AI agent doesn't actually remember anything. It's performing the illusion of memory by re-reading its notes before every response.
Problem 3: They Degrade with Length
Even within a single context window, performance degrades as length increases. Research has consistently shown that LLMs struggle with information in the middle of long contexts, a phenomenon called "lost in the middle." Information at the beginning and end of the context gets more attention; information in the middle gets partially ignored.
This means that even if you could fit everything into the context window, the model wouldn't process it all equally. Your carefully curated company knowledge might be sitting in the dead zone where the model pays least attention.
Why "Just Make It Bigger" Doesn't Work
The obvious solution is bigger context windows. Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro all offer a million tokens. Problem solved, right?
Not even close.
Cost scales linearly (or worse). A 1M token context window is 5x the cost of a 200K window. For frequent interactions, this becomes prohibitively expensive.
Retrieval quality decreases. Research shows that as context windows grow, the model's ability to find and use specific information within that context degrades. It's the digital equivalent of an overstuffed filing cabinet: everything's in there, but good luck finding what you need.
Latency increases. Larger contexts mean longer processing times. A 1M token context window can add significant latency to every response. That's not viable for interactive use cases.
It doesn't solve cross-session persistence. Even with a million-token window, the context resets between sessions. You'd need to reload all that context at the start of every new conversation. At current pricing, that's $5-20+ per session just to restore context.
Bigger context windows are useful, but they're a band-aid on a fundamental architectural problem. The real solution is a memory system that sits outside the context window.
How Memory Systems Actually Work
Production AI systems solve the memory problem with external memory architectures. These systems store, retrieve, and manage information independently of the LLM's context window. The model doesn't remember; the system remembers for it.
Retrieval-Augmented Generation (RAG)
The most common approach. Information is stored in a vector database. When the model needs context, relevant documents are retrieved based on semantic similarity and injected into the context window.
How it works:
- Information is converted to vector embeddings and stored
- When a query arrives, it's converted to a vector
- Similar vectors (relevant information) are retrieved
- Retrieved information is added to the context window
- The model generates a response using the augmented context
Limitations:
- Retrieval quality depends entirely on the embedding model and search algorithm
- Semantic similarity isn't the same as relevance. A document can be semantically similar but contextually irrelevant
- No understanding of temporal relationships. RAG doesn't inherently know that last week's pricing update supersedes last month's
- Chunk size is a constant tradeoff: too small and you lose context, too large and you waste tokens
RAG is a necessary component but not sufficient for real memory.
Conversation Summarization
Another common approach: periodically summarize the conversation and use the summary as context instead of the full history.
Limitations:
- Summarization loses detail. The summary might capture that you discussed pricing but not the specific numbers
- Cascading summarization (summarizing summaries) compounds information loss
- The model decides what's important to summarize, and it doesn't always get that right
Structured Knowledge Graphs
More sophisticated systems maintain structured knowledge about entities and their relationships. Instead of unstructured text retrieval, the system knows that "Acme Corp is a client, their main contact is Jane Smith, they signed in Q2 2025, and their contract is up for renewal in Q2 2026."
Advantages:
- Precise factual recall
- Temporal awareness (knows when information was learned or updated)
- Relationship tracking (connects related entities)
- Efficient storage (facts, not paragraphs)
Limitations:
- Complex to build and maintain
- Requires entity extraction and relationship mapping
- Can miss nuanced or unstructured insights
What OpFleet's Cortex System Does Differently
Most memory systems treat information storage as a monolithic problem. Everything goes into one bucket and you hope retrieval finds the right stuff. Cortex takes a different approach based on how human memory actually works.
Multi-Tier Memory Architecture
Cortex maintains four distinct memory tiers, each optimized for a different type of information:
Tier 1: Working Context. The operator's immediate focus. What it's actively working on, current task state, recent interactions. This lives in the context window and is tightly managed to maximize the signal-to-noise ratio.
Tier 2: Active Knowledge. Information the operator accesses frequently: key contacts, ongoing projects, current priorities, recent decisions. This is pre-loaded at session start and updated as things change. Think of it as what a human employee keeps on their desk.
Tier 3: Reference Knowledge. Broader organizational knowledge that's accessed on demand: process documentation, historical data, archived projects. Retrieved via semantic search when relevant, not loaded by default.
Tier 4: Deep Archive. Everything the operator has ever processed, stored for completeness and audit purposes. Rarely accessed directly but available for deep research or compliance needs.
Intelligent Context Management
The real innovation in Cortex isn't storage. It's context management: deciding what goes into the limited context window at any given moment.
Before each reasoning step, Cortex performs what we call context assembly:
- Intent analysis: What is the operator trying to do right now?
- Relevance scoring: What stored information is relevant to this specific task?
- Recency weighting: Recent information gets priority, but important older information isn't lost
- Contradiction resolution: When stored facts conflict, Cortex identifies the most recent and most reliable version
- Token budgeting: Given the available context window, what's the highest-value information to include?
This means the operator's context window is always optimally packed with relevant information rather than stuffed with everything or stripped to nothing.
Temporal Awareness
One of the most underappreciated aspects of memory is time. Human memory is inherently temporal. You know that the meeting on Monday supersedes the plan from last Friday. Most AI memory systems don't have this.
Cortex timestamps everything and uses temporal relationships for retrieval. When the operator retrieves information about a project, it gets the current state, not a random slice of history. When facts change, the system knows which version is current without requiring manual updates.
Memory Consolidation
Humans don't remember every detail of every day. They consolidate: important events become long-term memories, routine details fade. Cortex does the same thing.
After each work session, Cortex runs a consolidation process:
- Important decisions, outcomes, and new facts are promoted to Active Knowledge
- Routine interactions are summarized and archived
- Contradictions between new and stored information are flagged and resolved
- Knowledge connections are updated (new relationships between entities)
This keeps the memory system lean and relevant rather than growing unboundedly with every interaction.
Practical Impact: What Good Memory Looks Like
The difference between an AI system with and without proper memory is dramatic:
Without memory (typical chatbot/agent):
- "What does your company do?" (asked for the 50th time)
- Ignores preferences you've stated repeatedly
- Can't reference past decisions or their outcomes
- Treats every interaction as the first one
- Loses context mid-workflow when the conversation gets long
With memory (well-designed operator):
- Knows your company, team, key accounts, and current priorities
- Adapts to your communication preferences without being reminded
- References past decisions: "Last time we tried approach X, it resulted in Y"
- Maintains context across sessions, days, and weeks
- Manages its own context window to stay focused on what matters
The practical result is an AI system that gets better over time instead of starting from zero every day. After a month of working with a properly memory-enabled operator, the accumulated context and institutional knowledge make it dramatically more effective than on day one.
What You Can Do Today
If you're building or evaluating AI systems, here's what to look for:
Ask about memory architecture. "How does your system handle information across sessions?" If the answer is "we send conversation history with each request," you're looking at a chatbot with a long scroll bar, not a memory system.
Test temporal reasoning. Tell the system something on Monday, tell it something different on Wednesday, then ask about it on Friday. Does it know which information is current?
Check for context degradation. Have a long interaction (50+ messages) and see if the system still references information from the beginning of the conversation. If it's lost early context, the context window management is poor.
Look for knowledge structure. Can the system tell you what it knows? Can it explain why it made a certain decision based on past context? Systems with structured memory can answer these questions. Systems with simple RAG cannot.
Evaluate cost at scale. What does memory cost at 100 interactions per day? 1,000? If the answer is "it scales linearly with context window size," that's a warning sign.
The memory problem is the defining challenge of the current AI generation. Models are smart enough. Tools are capable enough. The bottleneck is making AI systems that remember, learn, and build on past experience. Whoever solves memory well wins the AI operator race. Everything else is a feature. Memory is the foundation.