Memory &
Knowledge
An LLM has no memory between API calls — every call starts fresh. The illusion of memory in your agents is entirely your responsibility as the builder. This section covers the four memory types every agent builder must understand, how each is implemented, when context windows overflow and what to do about it, and how semantic memory via vector stores extends what an agent can "know" far beyond its context window limit.
A Taxonomy of Agent Memory
Lilian Weng's 2023 survey on LLM-powered agents identifies four distinct memory types that map to different implementation strategies. Understanding this taxonomy prevents the most common mistake: trying to solve all memory problems with a single approach.
The Context Window Fills Up — Then What?
Every message appended to the conversation history grows the context window. In a long agent run — many tool calls, large API responses, extended reasoning — the context will eventually approach the model's limit. Unlike a database, there is no automatic overflow handling: you hit the limit and the API returns an error, or performance silently degrades.
There are three strategies for managing this, and the right choice depends on what information the agent actually needs to retain.
| STRATEGY | HOW IT WORKS | TRADE-OFF |
|---|---|---|
| Sliding Window | Keep only the last N turns. Drop the oldest messages when the window is full. | Simple. Loses early context permanently — the agent forgets what was decided at the start. |
| Summarization | Periodically compress older turns into a summary using an LLM call, then replace them with the summary. | Retains the gist. Introduces an extra API call and summary-induced hallucination risk. |
| Selective Retention | Mark messages as "must-keep" (critical decisions, user goals) vs. "evictable" (tool outputs, scratch reasoning). Only evict the latter. | Best quality. Most complex to implement — requires structured metadata on each message. |
"[Note: earlier conversation has been summarized above]" so the model knows its history is compressed.
Giving Your Agent Access to a Knowledge Base
When an agent needs to recall facts from a corpus too large to fit in context — internal docs, a product manual, a research library — you use a vector store. Documents are broken into chunks, each chunk is converted to a vector embedding (a list of numbers that encodes semantic meaning), and those embeddings are stored in a database optimised for similarity search.
At query time, the agent's question is embedded with the same model, and the store returns the chunks whose embeddings are closest in vector space — the semantically most relevant passages. These chunks are injected into the agent's context as retrieved knowledge. This is the foundation of Retrieval-Augmented Generation (RAG), covered in depth in Section 10.
Remembering What Happened Across Sessions
In-context and semantic memory operate within a session or across a static knowledge base. Episodic memory is different: it stores a structured log of past agent runs — what the user asked, what the agent did, what the outcome was — and makes that log retrievable in future sessions.
The Generative Agents paper (Park et al., 2023) introduced a concrete implementation: agents store experiences as natural language strings, each tagged with recency, importance, and relevance scores. When the agent needs to recall past experience, it scores all memories and injects the top-scoring ones into context.
{
"id": "ep_2026-04-04_001",
"timestamp": "2026-04-04T14:23:00Z",
"user_goal": "Summarise Q1 sales report",
"tools_used": ["read_file", "web_search"],
"outcome": "success",
"summary": "Read sales.csv, searched for industry benchmarks, returned 3-para summary.",
"tokens_used": 4821
}
Rules for Building Memory-Aware Agents
usage.input_tokens after every API call. Set a hard threshold (e.g., 80% of context limit) that triggers compression before you hit the wall mid-task.[Earlier turns summarised]. This prevents the model from confusing compressed history with current instructions.Verified References
Every claim in this section is grounded in one of these sources.
| Source | Type | Covers | Recency |
|---|---|---|---|
| Lilian Weng — LLM Powered Autonomous Agents | Blog / Survey | Four memory types taxonomy, in-context vs external memory | June 2023 |
| Park et al. — Generative Agents | Academic paper | Episodic memory, recency/importance/relevance scoring, memory streams | 2023 |
| Anthropic — Prompt Caching | Official docs | Caching static prefixes, cost reduction for long system prompts | Maintained 2024–2026 |
| ChromaDB — Official Docs | Official docs | Local vector store, collections, add/query API | Maintained 2023–2026 |
| Anthropic — Embeddings Guide | Official docs | Embedding models, similarity search, practical recommendations | Maintained 2024–2026 |
| LangChain — Vector Store Concepts | Official docs | Chunking strategies, ANN search, vector store integrations | Maintained 2023–2026 |
Build an Agent with Sliding-Window Memory and an Episodic Log
You will build a conversational agent that manages its own context window using a sliding window strategy, logs every session to a JSON episodic memory file, and loads that log at startup so it can reference past conversations. By the end you will be able to watch the context window fill and compress in real time.
Use the same agent-lab virtualenv from previous sections. Create a new file:
cd agent-lab && source .venv/bin/activate touch memory_agent.py
The episodic store reads and writes a JSON file. On startup the agent loads past sessions; on shutdown it appends the current session's summary. Simple but effective for single-user agents.
import os import json import anthropic from datetime import datetime, timezone from dotenv import load_dotenv load_dotenv() client = anthropic.Anthropic() MEMORY_FILE = "episodes.json" MAX_TURNS = 6 # sliding window: keep last N user+assistant pairs TOKEN_WARN = 8000 # log a warning when input tokens exceed this def load_episodes() -> list[dict]: """Load past episodes from the JSON file, or return empty list.""" if not os.path.exists(MEMORY_FILE): return [] with open(MEMORY_FILE) as f: return json.load(f) def save_episode(goal: str, turns: int, outcome: str) -> None: """Append the current session to the episodic memory file.""" episodes = load_episodes() episodes.append({ "id": f"ep_{datetime.now(timezone.utc).strftime('%Y%m%d_%H%M%S')}", "timestamp": datetime.now(timezone.utc).isoformat(), "user_goal": goal, "turns": turns, "outcome": outcome, }) with open(MEMORY_FILE, "w") as f: json.dump(episodes, f, indent=2) print(f"[memory] episode saved → {MEMORY_FILE}")
This function trims the message list to the last MAX_TURNS user/assistant pairs whenever it grows too long. Notice the attribution marker inserted when trimming — this tells the model its history was cut.
def trim_messages(messages: list[dict]) -> list[dict]: """Keep only the last MAX_TURNS user+assistant pairs. Inserts a notice when history is trimmed so the model is aware.""" # A "pair" is one user + one assistant message = 2 items max_messages = MAX_TURNS * 2 if len(messages) <= max_messages: return messages trimmed = messages[-max_messages:] notice = { "role": "user", "content": ( "[System note: earlier conversation turns were trimmed to fit " "the context window. Only the last few turns are shown.]" ) } print(f"[memory] context trimmed: keeping last {MAX_TURNS} turns") return [notice] + trimmed
At startup, the agent loads past episodes and includes a summary of them in the system prompt. This is a simple form of episodic memory injection — the model knows what happened before without re-playing the full conversation.
def build_system_prompt(episodes: list[dict]) -> str: """Build the system prompt, injecting past episode summaries.""" base = ( "You are a helpful assistant with memory of past conversations. " "You answer questions clearly and concisely." ) if not episodes: return base # Inject the last 3 episodes as context recent = episodes[-3:] history_lines = [ f"- [{ep['timestamp'][:10]}] Goal: {ep['user_goal']} " f"({ep['turns']} turns, {ep['outcome']})" for ep in recent ] history_block = "\n".join(history_lines) return ( f"{base}\n\n" "Past sessions (most recent first):\n" f"{history_block}\n\n" "Use this context to give continuity to the user across sessions." )
The main loop handles user input, calls the API, trims context when needed, tracks token usage, and saves the episode on exit. Run it and have a multi-turn conversation — then quit, run it again, and watch the agent reference the previous session.
def run_chat() -> None: episodes = load_episodes() system = build_system_prompt(episodes) messages: list[dict] = [] turn_count = 0 first_message = "" print("\n=== Memory Agent ===") print(f"Loaded {len(episodes)} past episode(s).") print('Type your message. Type "quit" to exit and save.\n') while True: user_input = input("You: ").strip() if not user_input: continue if user_input.lower() == "quit": break if not first_message: first_message = user_input messages.append({"role": "user", "content": user_input}) messages = trim_messages(messages) response = client.messages.create( model="claude-opus-4-6", # check docs.anthropic.com/en/docs/about-claude/models max_tokens=512, system=system, messages=messages ) input_tokens = response.usage.input_tokens if input_tokens > TOKEN_WARN: print(f"[memory] ⚠ high token usage: {input_tokens} input tokens") reply = response.content[0].text messages.append({"role": "assistant", "content": reply}) turn_count += 1 print(f"Agent [{input_tokens} tok]: {reply}\n") # Save episode on exit if turn_count > 0: save_episode( goal=first_message, turns=turn_count, outcome="completed" ) if __name__ == "__main__": run_chat()
python memory_agent.py
=== Memory Agent === Loaded 0 past episode(s). Type your message. Type "quit" to exit and save. You: My name is Alex and I'm building a RAG pipeline in Python. Agent [312 tok]: Great to meet you, Alex! Happy to help with your RAG pipeline... You: What are the main chunking strategies I should consider? Agent [498 tok]: The three main strategies are fixed-size, sentence-boundary, and semantic... You: quit [memory] episode saved → episodes.json
=== Memory Agent === Loaded 1 past episode(s). Type your message. Type "quit" to exit and save. You: What was I working on last time? Agent [441 tok]: Based on our previous session, you're Alex and you were building a RAG pipeline in Python. You asked about chunking strategies — we covered fixed-size, sentence-boundary, and semantic chunking. Would you like to continue?
[memory] context trimmed message appear. Then run a second session and ask "what was I working on last time?" — the agent retrieves the episode and answers without you telling it anything.
Replace trim_messages with a version that summarises the dropped turns using a second LLM call before discarding them. The summary is inserted as a system note. Compare the quality of memory retention between the two approaches over a long conversation.
def trim_messages_with_summary(messages: list[dict]) -> list[dict]: """Summarise old turns before trimming them.""" max_messages = MAX_TURNS * 2 if len(messages) <= max_messages: return messages old_turns = messages[:-max_messages] keep_turns = messages[-max_messages:] # Summarise the turns being dropped summary_prompt = ( "Summarise these conversation turns in 2-3 sentences, " "preserving key facts, decisions, and user preferences:\n\n" + "\n".join( f"{m['role'].upper()}: {m['content'] if isinstance(m['content'], str) else str(m['content'])}" for m in old_turns ) ) summary_resp = client.messages.create( model="claude-opus-4-6", max_tokens=200, messages=[{"role": "user", "content": summary_prompt}] ) summary = summary_resp.content[0].text notice = { "role": "user", "content": f"[Earlier turns summarised]: {summary}" } print(f"[memory] summarised {len(old_turns)} old turns") return [notice] + keep_turns