Course  /  06 · Memory & Knowledge
SECTION 06 CORE LAB

Memory &
Knowledge

An LLM has no memory between API calls — every call starts fresh. The illusion of memory in your agents is entirely your responsibility as the builder. This section covers the four memory types every agent builder must understand, how each is implemented, when context windows overflow and what to do about it, and how semantic memory via vector stores extends what an agent can "know" far beyond its context window limit.

01 · THE FOUR MEMORY TYPES

A Taxonomy of Agent Memory

Lilian Weng's 2023 survey on LLM-powered agents identifies four distinct memory types that map to different implementation strategies. Understanding this taxonomy prevents the most common mistake: trying to solve all memory problems with a single approach.

📋
In-Context (Working Memory)
Information inside the current context window: the system prompt, conversation history, tool results, and injected documents. Fast and zero-latency — the model uses it directly. Bounded by the context window limit.
WIDELY USED (2022–2026)
🗄️
External Semantic Memory
A vector store (Chroma, Pinecone, Weaviate) that holds embedded documents. The agent queries it with a similarity search and retrieves the most relevant chunks to inject into context. Scales to millions of documents.
WIDELY USED (2023–2026)
📓
Episodic Memory
A log of past agent runs — what was asked, what tools were called, what the outcome was. Allows an agent to recall "what happened last time" and avoid repeating mistakes across sessions.
EMERGING (2023–2026)
⚙️
Procedural Memory
The agent's baked-in skills and behaviors — encoded in the model weights (via training/fine-tuning) or in the system prompt (instructions, tool schemas). Slowest to update but most durable.
WIDELY USED (2022–2026)
Analogy — human memory parallel: In-context = your working memory right now. Semantic = your long-term factual knowledge (can recall but slow). Episodic = autobiographical memory (what happened to you last Tuesday). Procedural = muscle memory / skills (how to ride a bike). Agents need all four for complex long-horizon tasks.
02 · IN-CONTEXT MEMORY MANAGEMENT

The Context Window Fills Up — Then What?

Every message appended to the conversation history grows the context window. In a long agent run — many tool calls, large API responses, extended reasoning — the context will eventually approach the model's limit. Unlike a database, there is no automatic overflow handling: you hit the limit and the API returns an error, or performance silently degrades.

There are three strategies for managing this, and the right choice depends on what information the agent actually needs to retain.

STRATEGY HOW IT WORKS TRADE-OFF
Sliding Window Keep only the last N turns. Drop the oldest messages when the window is full. Simple. Loses early context permanently — the agent forgets what was decided at the start.
Summarization Periodically compress older turns into a summary using an LLM call, then replace them with the summary. Retains the gist. Introduces an extra API call and summary-induced hallucination risk.
Selective Retention Mark messages as "must-keep" (critical decisions, user goals) vs. "evictable" (tool outputs, scratch reasoning). Only evict the latter. Best quality. Most complex to implement — requires structured metadata on each message.
Never silently truncate. If you drop messages from history without telling the model, it may contradict earlier decisions it can no longer see. At minimum, insert a system note: "[Note: earlier conversation has been summarized above]" so the model knows its history is compressed.
Prompt caching reduces the cost of long static prefixes. If your agent has a large, unchanging system prompt (tool schemas, background docs), use your provider's prompt caching feature to avoid re-processing it every call. This is separate from conversation history management — caching handles the static prefix; sliding window / summarization handle the growing history.
03 · SEMANTIC MEMORY — VECTOR STORES

Giving Your Agent Access to a Knowledge Base

When an agent needs to recall facts from a corpus too large to fit in context — internal docs, a product manual, a research library — you use a vector store. Documents are broken into chunks, each chunk is converted to a vector embedding (a list of numbers that encodes semantic meaning), and those embeddings are stored in a database optimised for similarity search.

At query time, the agent's question is embedded with the same model, and the store returns the chunks whose embeddings are closest in vector space — the semantically most relevant passages. These chunks are injected into the agent's context as retrieved knowledge. This is the foundation of Retrieval-Augmented Generation (RAG), covered in depth in Section 10.

// SEMANTIC MEMORY: INDEX + QUERY PIPELINE
INDEX TIME (one-time):
Documents → ChunkEmbedVector Store
QUERY TIME (per agent call):
Question → EmbedSimilarity Search → Top-K Chunks → Inject into Context
🧮
Embeddings
Dense vector representations of text — typically 768–3072 dimensions. Semantically similar texts have nearby vectors. Produced by a dedicated embedding model (separate from the chat LLM).
WIDELY USED (2022–2026)
🗃️
Vector Stores
Databases optimised for approximate nearest-neighbour search (ANN). Options range from local (ChromaDB, FAISS) to managed cloud services. Choice depends on scale, latency, and ops burden.
WIDELY USED (2022–2026)
✂️
Chunking Strategy
How you split documents before embedding matters as much as the embedding model. Fixed-size, sentence-boundary, and semantic chunking each have different recall/precision trade-offs.
DESIGN DECISION
Embedding model and vector store must match. Always embed queries with the same model used to embed the documents. Swapping embedding models invalidates the entire index — you must re-embed all documents. This is a production gotcha that has caused real outages.
04 · EPISODIC MEMORY

Remembering What Happened Across Sessions

In-context and semantic memory operate within a session or across a static knowledge base. Episodic memory is different: it stores a structured log of past agent runs — what the user asked, what the agent did, what the outcome was — and makes that log retrievable in future sessions.

The Generative Agents paper (Park et al., 2023) introduced a concrete implementation: agents store experiences as natural language strings, each tagged with recency, importance, and relevance scores. When the agent needs to recall past experience, it scores all memories and injects the top-scoring ones into context.

// MINIMAL EPISODIC MEMORY ENTRY (JSON)
{
  "id": "ep_2026-04-04_001",
  "timestamp": "2026-04-04T14:23:00Z",
  "user_goal": "Summarise Q1 sales report",
  "tools_used": ["read_file", "web_search"],
  "outcome": "success",
  "summary": "Read sales.csv, searched for industry benchmarks, returned 3-para summary.",
  "tokens_used": 4821
}
When episodic memory matters: A customer support agent that sees the same user repeatedly. A coding agent that should remember which libraries a project uses. A personal assistant that learns user preferences over time. Without episodic memory, every session starts cold and the agent makes the same avoidable mistakes.
05 · MEMORY DESIGN PRINCIPLES

Rules for Building Memory-Aware Agents

PRINCIPLE 01
Match memory type to access pattern
In-context for the current task's working state. Semantic store for stable knowledge. Episodic log for cross-session continuity. Don't use one for all three.
PRINCIPLE 02
Always track token growth
Log usage.input_tokens after every API call. Set a hard threshold (e.g., 80% of context limit) that triggers compression before you hit the wall mid-task.
PRINCIPLE 03
Compress with attribution
When summarising old context, always insert a marker like [Earlier turns summarised]. This prevents the model from confusing compressed history with current instructions.
PRINCIPLE 04
Store what you cannot reconstruct
Don't store raw tool outputs that can be re-fetched cheaply. Store decisions, outcomes, and user preferences — the things that would be expensive or impossible to reproduce.
PRINCIPLE 05
Retrieval quality determines output quality
Garbage in, garbage out applies to memory retrieval. If the wrong chunks are injected into context, the agent will reason from irrelevant or stale information. Validate retrieval separately from generation.
PRINCIPLE 06
Memory has a trust boundary
Content retrieved from external memory was not in the original system prompt. Treat it like user input — sanitise it, don't blindly trust it. A poisoned memory store is a prompt injection vector.
SOURCES USED IN THIS SECTION

Verified References

Every claim in this section is grounded in one of these sources.

SourceTypeCoversRecency
Lilian Weng — LLM Powered Autonomous Agents Blog / Survey Four memory types taxonomy, in-context vs external memory June 2023
Park et al. — Generative Agents Academic paper Episodic memory, recency/importance/relevance scoring, memory streams 2023
Anthropic — Prompt Caching Official docs Caching static prefixes, cost reduction for long system prompts Maintained 2024–2026
ChromaDB — Official Docs Official docs Local vector store, collections, add/query API Maintained 2023–2026
Anthropic — Embeddings Guide Official docs Embedding models, similarity search, practical recommendations Maintained 2024–2026
LangChain — Vector Store Concepts Official docs Chunking strategies, ANN search, vector store integrations Maintained 2023–2026
HANDS-ON LAB

Build an Agent with Sliding-Window Memory and an Episodic Log

You will build a conversational agent that manages its own context window using a sliding window strategy, logs every session to a JSON episodic memory file, and loads that log at startup so it can reference past conversations. By the end you will be able to watch the context window fill and compress in real time.

🔬
Memory-Aware Conversational Agent
PYTHON · ~100 LINES · ANTHROPIC API KEY REQUIRED
1
Set up and create the file

Use the same agent-lab virtualenv from previous sections. Create a new file:

BASH
cd agent-lab && source .venv/bin/activate
touch memory_agent.py
2
Implement the episodic memory store

The episodic store reads and writes a JSON file. On startup the agent loads past sessions; on shutdown it appends the current session's summary. Simple but effective for single-user agents.

PYTHON — memory_agent.py
import os
import json
import anthropic
from datetime import datetime, timezone
from dotenv import load_dotenv

load_dotenv()
client = anthropic.Anthropic()

MEMORY_FILE = "episodes.json"
MAX_TURNS   = 6   # sliding window: keep last N user+assistant pairs
TOKEN_WARN  = 8000 # log a warning when input tokens exceed this


def load_episodes() -> list[dict]:
    """Load past episodes from the JSON file, or return empty list."""
    if not os.path.exists(MEMORY_FILE):
        return []
    with open(MEMORY_FILE) as f:
        return json.load(f)


def save_episode(goal: str, turns: int, outcome: str) -> None:
    """Append the current session to the episodic memory file."""
    episodes = load_episodes()
    episodes.append({
        "id": f"ep_{datetime.now(timezone.utc).strftime('%Y%m%d_%H%M%S')}",
        "timestamp": datetime.now(timezone.utc).isoformat(),
        "user_goal": goal,
        "turns": turns,
        "outcome": outcome,
    })
    with open(MEMORY_FILE, "w") as f:
        json.dump(episodes, f, indent=2)
    print(f"[memory] episode saved → {MEMORY_FILE}")
3
Build the sliding-window message manager

This function trims the message list to the last MAX_TURNS user/assistant pairs whenever it grows too long. Notice the attribution marker inserted when trimming — this tells the model its history was cut.

PYTHON — memory_agent.py (continued)
def trim_messages(messages: list[dict]) -> list[dict]:
    """Keep only the last MAX_TURNS user+assistant pairs.
    Inserts a notice when history is trimmed so the model is aware."""
    # A "pair" is one user + one assistant message = 2 items
    max_messages = MAX_TURNS * 2
    if len(messages) <= max_messages:
        return messages

    trimmed = messages[-max_messages:]
    notice = {
        "role": "user",
        "content": (
            "[System note: earlier conversation turns were trimmed to fit "
            "the context window. Only the last few turns are shown.]"
        )
    }
    print(f"[memory] context trimmed: keeping last {MAX_TURNS} turns")
    return [notice] + trimmed
4
Build the system prompt with episodic context

At startup, the agent loads past episodes and includes a summary of them in the system prompt. This is a simple form of episodic memory injection — the model knows what happened before without re-playing the full conversation.

PYTHON — memory_agent.py (continued)
def build_system_prompt(episodes: list[dict]) -> str:
    """Build the system prompt, injecting past episode summaries."""
    base = (
        "You are a helpful assistant with memory of past conversations. "
        "You answer questions clearly and concisely."
    )
    if not episodes:
        return base

    # Inject the last 3 episodes as context
    recent = episodes[-3:]
    history_lines = [
        f"- [{ep['timestamp'][:10]}] Goal: {ep['user_goal']} "
        f"({ep['turns']} turns, {ep['outcome']})"
        for ep in recent
    ]
    history_block = "\n".join(history_lines)
    return (
        f"{base}\n\n"
        "Past sessions (most recent first):\n"
        f"{history_block}\n\n"
        "Use this context to give continuity to the user across sessions."
    )
5
Implement the main chat loop

The main loop handles user input, calls the API, trims context when needed, tracks token usage, and saves the episode on exit. Run it and have a multi-turn conversation — then quit, run it again, and watch the agent reference the previous session.

PYTHON — memory_agent.py (continued)
def run_chat() -> None:
    episodes = load_episodes()
    system   = build_system_prompt(episodes)
    messages: list[dict] = []
    turn_count = 0
    first_message = ""

    print("\n=== Memory Agent ===")
    print(f"Loaded {len(episodes)} past episode(s).")
    print('Type your message. Type "quit" to exit and save.\n')

    while True:
        user_input = input("You: ").strip()
        if not user_input:
            continue
        if user_input.lower() == "quit":
            break

        if not first_message:
            first_message = user_input

        messages.append({"role": "user", "content": user_input})
        messages = trim_messages(messages)

        response = client.messages.create(
            model="claude-opus-4-6",  # check docs.anthropic.com/en/docs/about-claude/models
            max_tokens=512,
            system=system,
            messages=messages
        )

        input_tokens = response.usage.input_tokens
        if input_tokens > TOKEN_WARN:
            print(f"[memory] ⚠ high token usage: {input_tokens} input tokens")

        reply = response.content[0].text
        messages.append({"role": "assistant", "content": reply})
        turn_count += 1

        print(f"Agent [{input_tokens} tok]: {reply}\n")

    # Save episode on exit
    if turn_count > 0:
        save_episode(
            goal=first_message,
            turns=turn_count,
            outcome="completed"
        )


if __name__ == "__main__":
    run_chat()
6
Run two sessions and observe cross-session memory
BASH
python memory_agent.py
SESSION 1 — EXAMPLE
=== Memory Agent ===
Loaded 0 past episode(s).
Type your message. Type "quit" to exit and save.

You: My name is Alex and I'm building a RAG pipeline in Python.
Agent [312 tok]: Great to meet you, Alex! Happy to help with your RAG pipeline...

You: What are the main chunking strategies I should consider?
Agent [498 tok]: The three main strategies are fixed-size, sentence-boundary, and semantic...

You: quit
[memory] episode saved → episodes.json
SESSION 2 — EXAMPLE (run again)
=== Memory Agent ===
Loaded 1 past episode(s).
Type your message. Type "quit" to exit and save.

You: What was I working on last time?
Agent [441 tok]: Based on our previous session, you're Alex and you were building
a RAG pipeline in Python. You asked about chunking strategies — we covered
fixed-size, sentence-boundary, and semantic chunking. Would you like to continue?
What to observe: The token count printed next to each response grows as the conversation history accumulates. After 6 turns, watch the [memory] context trimmed message appear. Then run a second session and ask "what was I working on last time?" — the agent retrieves the episode and answers without you telling it anything.
7
Extension: replace sliding window with summarisation

Replace trim_messages with a version that summarises the dropped turns using a second LLM call before discarding them. The summary is inserted as a system note. Compare the quality of memory retention between the two approaches over a long conversation.

PYTHON — summarisation trim (replace trim_messages)
def trim_messages_with_summary(messages: list[dict]) -> list[dict]:
    """Summarise old turns before trimming them."""
    max_messages = MAX_TURNS * 2
    if len(messages) <= max_messages:
        return messages

    old_turns  = messages[:-max_messages]
    keep_turns = messages[-max_messages:]

    # Summarise the turns being dropped
    summary_prompt = (
        "Summarise these conversation turns in 2-3 sentences, "
        "preserving key facts, decisions, and user preferences:\n\n"
        + "\n".join(
            f"{m['role'].upper()}: {m['content'] if isinstance(m['content'], str) else str(m['content'])}"
            for m in old_turns
        )
    )
    summary_resp = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=200,
        messages=[{"role": "user", "content": summary_prompt}]
    )
    summary = summary_resp.content[0].text

    notice = {
        "role": "user",
        "content": f"[Earlier turns summarised]: {summary}"
    }
    print(f"[memory] summarised {len(old_turns)} old turns")
    return [notice] + keep_turns

Finished the theory and completed the lab? Mark this section complete to track your progress.

Last updated: