A Building Agentic AI

Blog / Beginner / Giving Your Agent Memory: A Minimal Implementation

Giving Your Agent Memory: A Minimal Implementation

"Memory" is one word for four different problems your agent has. The conversation buffer, summarization, episodic recall, semantic retrieval, and key-value preferences, each built from scratch in raw Python with no framework, plus the decision guide for which one you actually need.

Muhammad Arbab

Muhammad Arbab · 14 years shipping AI

· 12 min read · Beginner

Share LinkedIn · X · Email ·

“Memory” is one word for four different problems an agent has. The conversation buffer holds the current turn. Summarization compresses a task that ran too long. Episodic memory recalls what happened last week. Semantic memory retrieves a fact from a corpus too large to keep in context. Treating them as one feature is the most common mistake in early agent code. Treating them as four small, named stores with four clear purposes is the production answer.

This post builds each of the four in raw Python with no framework, wires them together in a minimal example, and gives you the decision guide for which memory to add when. If you have not built the underlying loop yet, start with Build an AI Agent From Scratch in Python and The Agent Loop, Explained. This piece picks up where those leave off and answers the next question every reader hits: how do I make the agent remember anything?

For the broader architectural framing, see What Counts as Agentic AI, and What Does Not. The memory chapter of Understanding Agentic AI Systems walks through the same four stores with a single running example end to end.

The four memories your agent actually needs

Most “memory” tutorials show one box labeled memory and put a vector database inside it. That is a category error. There is no single store; there are four, and they have different jobs.

MemoryLifetimeWhat it holdsStorage
Conversation bufferThis turnThe full message list of the current taskA Python list
SummaryThis taskA one-paragraph compression of older turnsA string in state
EpisodicThis userWhat happened in past sessionsSQLite or JSON file
SemanticThe corpusSearchable facts that do not fit in contextVector store
Key-value (preferences)This userDurable facts: name, timezone, preferencesKey-value store

Notice that three of the five rows above do not need a vector database. The conversation buffer is a list. The summary is a string. Preferences are a dict. A vector store earns its cost only when you have many items, none of which fit in context, and you need fuzzy retrieval over them. That is a real problem when you have it; it is not the first problem you have.

The default progression: start with the buffer alone. Add summarization when conversations get long. Add a preferences store when you find yourself reminding the agent of the same fact every session. Add episodic logs when “what did we discuss last week” becomes a real user request. Add a vector store when you have a corpus large enough to need search. Each step is driven by a failure of the previous step, not by a checklist.

Short-term: the conversation buffer

This is the memory that already exists in every tutorial you have read. A list of message dicts that grows with each turn:

messages = [
{"role": "system", "content": "You are a helpful agent."},
{"role": "user", "content": "What is the weather in Karachi?"},
# ... model and tool messages appended each turn
]

That is the entire short-term memory for 80% of agents on day one. The model sees the full list on every call, so anything written into it persists for the rest of the task. When the task ends, the list is discarded.

Two things to understand about the buffer:

It is the highest-bandwidth memory you have. The model sees every byte on every turn. That makes it the right place for the current goal, the most recent tool result, and the instructions the agent needs to follow. It is the wrong place for things that do not change (those go in the system prompt, written once, not appended each turn) and for things that grow without bound (those need summarization, below).

It has a hard token limit. A 32k context model has roughly 32,000 tokens of room for the system prompt, the buffer, and the response combined. A tool that returns 5,000 tokens of JSON eats one-sixth of your room on every call. Capping tool outputs and being deliberate about what re-enters the buffer is the cheapest performance fix in any agent.

For the deeper version of how tool results re-enter the buffer, see Tool Calling From First Principles.

Summarization: when context starts to fill

Once a task crosses a token threshold (say, 60% of the model’s context window), you have to shrink the buffer or risk hitting the wall. The pattern is simple: replace the oldest N turns with a single summary message.

def maybe_summarize(messages, client, threshold_tokens=20_000, keep_recent=6):
if estimate_tokens(messages) < threshold_tokens:
return messages
# Keep the system prompt and the N most recent messages verbatim.
system = messages[0]
recent = messages[-keep_recent:]
older = messages[1:-keep_recent]
if not older:
return messages
summary = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Summarize this conversation in 3 sentences. Preserve goals, decisions, and any facts the agent must remember."},
*older,
],
).choices[0].message.content
return [
system,
{"role": "system", "content": f"Summary of earlier conversation: {summary}"},
*recent,
]

A few things to notice:

  • Run this between turns, not in the middle of one. Summarizing mid-step can drop a tool result the model was about to use.
  • Keep the system prompt and the recent turns verbatim. Summarizing the system prompt is how agents forget their bounds. Summarizing the most recent turns is how they lose continuity.
  • Use a smaller, cheaper model for the summary. The summarization step is high-volume and not the hard part of the task. A mini-class model is fine here even if your main agent uses a larger one.

Summarization buys you 5x to 10x more turns before the wall. It does not solve the underlying problem (the agent is producing or consuming too much context), so if you find yourself summarizing every few turns, the answer is usually to look at which tool is returning too much, not to summarize harder.

Episodic memory: “what did we discuss last week”

When users come back across sessions, the conversation buffer is gone. To remember anything across sessions, you need a store that outlives the run.

For the first few weeks, a SQLite table is the right answer:

import sqlite3
from datetime import datetime
db = sqlite3.connect("agent.db")
db.execute("""
CREATE TABLE IF NOT EXISTS episodes (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id TEXT NOT NULL,
created_at TEXT NOT NULL,
summary TEXT NOT NULL
)
""")
def save_episode(user_id: str, summary: str) -> None:
db.execute(
"INSERT INTO episodes (user_id, created_at, summary) VALUES (?, ?, ?)",
(user_id, datetime.utcnow().isoformat(), summary),
)
db.commit()
def recall_episodes(user_id: str, n: int = 5) -> list[str]:
rows = db.execute(
"SELECT summary FROM episodes WHERE user_id = ? ORDER BY created_at DESC LIMIT ?",
(user_id, n),
).fetchall()
return [row[0] for row in rows]

Expose these as tools the agent can call, not as state the system writes silently:

TOOLS = [
{
"type": "function",
"function": {
"name": "recall_recent_sessions",
"description": "Look up summaries of the user's last few sessions. Call this when the user references something from a past conversation.",
"parameters": {
"type": "object",
"properties": {"n": {"type": "integer", "description": "How many recent sessions, default 5."}},
},
},
},
]

Two reasons explicit tools beat silent injection.

Auditability. Every memory read shows up in the trace as a tool call with arguments. When a user reports “the agent remembered something it should not have,” you can answer the question. With silent injection, you cannot.

Cost discipline. If recall is automatic, it runs every turn and bloats context. If recall is a tool, the model calls it only when it needs to.

End-of-session, write a summary into episodes and call it done. That gets you the first useful version of “the agent remembers me” without any vectors.

Semantic memory: vector retrieval

You need a vector store when keyword search stops finding the right item. Symptoms: the user asks “the thing we discussed about onboarding,” there are 4,000 prior summaries, none contains the literal word “onboarding,” and the right one says “the new-hire flow.” Vector retrieval is what turns “onboarding” into “new-hire flow.”

The smallest useful version is a single table with an embedding column:

from openai import OpenAI
client = OpenAI()
def embed(text: str) -> list[float]:
return client.embeddings.create(
model="text-embedding-3-small",
input=text,
).data[0].embedding
def remember_fact(user_id: str, fact: str) -> None:
vec = embed(fact)
# In a real system, store vec in pgvector / Qdrant / etc.
# For the minimum viable version, a numpy array on disk works.
facts_table.insert({"user_id": user_id, "fact": fact, "embedding": vec})
def recall_relevant(user_id: str, query: str, k: int = 3) -> list[str]:
qvec = embed(query)
# Cosine similarity over the user's facts; return top k.
return top_k_by_cosine(facts_table, user_id, qvec, k)

A few load-bearing decisions:

  • Embed the same thing you will search. If you embed full conversation summaries and the user queries them with short questions, the similarity will be noisy. A common fix is to embed both the summary and a generated “questions this summary answers” string, and search across both.
  • Scope by user. Almost every multi-user agent has had at least one near-miss where one user’s data was retrieved for another. Scope every query by user_id at the storage layer, not in the agent prompt.
  • The fact-extraction step is what makes vector memory useful. Storing whole conversations as one vector each is a recipe for fuzzy recall. Extracting the durable facts as separate items (“the user lives in Karachi,” “the user works in fintech,” “the user dislikes long meetings”) and storing each as its own vector is what produces clean retrieval.

You will outgrow a numpy array on disk around 10,000 items or when concurrent writes start to matter. Postgres with pgvector is the boring, correct upgrade. Reach for a managed vector DB only when you have a measurement that says you need it.

Key-value: durable preferences

A small store for durable user facts: name, timezone, role, preferences, “do not surface refund options under $5.” These are facts you want the agent to know on every turn without paying the embedding cost.

def get_prefs(user_id: str) -> dict:
row = db.execute("SELECT prefs FROM users WHERE id = ?", (user_id,)).fetchone()
return json.loads(row[0]) if row else {}
def set_pref(user_id: str, key: str, value) -> None:
prefs = get_prefs(user_id)
prefs[key] = value
db.execute(
"INSERT OR REPLACE INTO users (id, prefs) VALUES (?, ?)",
(user_id, json.dumps(prefs)),
)
db.commit()

Two patterns work for surfacing these to the agent:

Inject on session start. The system prompt for each session is built dynamically and includes the prefs blob. Good when the prefs are small (under ~500 tokens) and apply to every turn.

Tool-on-demand. A lookup_user_pref tool the agent calls when it needs one. Good when prefs are large or sensitive.

Writes are almost always a separate tool the agent calls deliberately: set_user_pref(name, value). Silent preference writes are how you wake up to “the agent silently flipped this user’s setting based on a sentence in a support ticket.” Make every write a logged, intentional decision.

Wiring it together

Here is the minimal end-to-end shape, in about 60 lines, that uses all four memory types:

def run_turn(user_id: str, user_text: str, session_state: dict) -> str:
# 1. Build the buffer for this turn.
prefs = get_prefs(user_id)
system_prompt = f"You are a helpful agent. User prefs: {prefs}"
messages = [
{"role": "system", "content": system_prompt},
*session_state.get("buffer", []),
{"role": "user", "content": user_text},
]
# 2. Maybe summarize before calling the model.
messages = maybe_summarize(messages, client)
# 3. The agent loop.
for _ in range(MAX_STEPS):
resp = client.chat.completions.create(
model="gpt-4o-mini", messages=messages, tools=TOOLS,
)
msg = resp.choices[0].message
messages.append(msg)
if not msg.tool_calls:
break
for call in msg.tool_calls:
# recall_episodes, recall_relevant, set_user_pref, etc. live here.
result = run_tool(call, user_id)
messages.append({"role": "tool", "tool_call_id": call.id, "content": result})
# 4. Persist what survives the session.
session_state["buffer"] = messages[1:]
return msg.content
def end_session(user_id: str, session_state: dict) -> None:
summary = summarize_session(session_state["buffer"], client)
save_episode(user_id, summary)
for fact in extract_durable_facts(session_state["buffer"], client):
remember_fact(user_id, fact)

That is the whole pattern. The four memory stores are visible as four small functions: get_prefs, maybe_summarize, save_episode, remember_fact. The agent uses three of them through tool calls during a turn (recall_episodes, recall_relevant, set_user_pref). The system writes the other two at session boundaries.

Two things this shape gets right that most tutorials get wrong:

  • Reads are tools the agent decides to call. The model picks when to recall, with what query, and how much. The system does not silently dump prior sessions into context on every turn.
  • Writes are bounded events. End of session writes summaries and durable facts. Mid-session, the agent has to explicitly call set_user_pref to change state. There is no path for state to change without showing up in the trace.

Which memory do you actually need?

A decision tree, in the order you will hit each fork.

  1. Just starting? Buffer only. Stop here until you can name a specific failure.
  2. Conversations getting long? Add summarization. Set a token threshold; replace older turns with a one-paragraph summary.
  3. Users coming back across sessions and expecting continuity? Add an episodic store. SQLite is enough until you measure that it is not.
  4. The agent keeps re-asking for the same user facts? Add a preferences store. Small, durable, indexed by user_id.
  5. The agent fails to find a past item that exists but does not match keywords? Add semantic retrieval. Start with embeddings + an array; upgrade to pgvector when concurrency or volume forces it.

Most production agents in 2026 land at steps 1 to 4. Step 5 is real but is often premature; many teams reach for it before exhausting steps 2 to 4 and pay the operational cost without the benefit.

Production gotchas

A few traps to bound before they bite.

Silent retrieval blowing up context. If you auto-inject the top 10 vector-retrieved facts on every turn, you have re-built unbounded tool outputs through a different door. Cap the retrieval count; cap the per-item length; surface “you have more matches if you want” as a separate, explicit tool the agent can call again.

Cross-user contamination. Every memory read and write must be scoped by user_id at the storage layer. Do not rely on the agent prompt to enforce this. The single most expensive memory bug class is “User A’s facts retrieved for User B’s session.”

Prompt-injection via stored memory. If a malicious user can get text into your memory store (a support ticket they wrote, a document they uploaded, a chat message that survives summarization), they can plant instructions that affect every future session. Treat stored memory the same as fresh tool data: untrusted, never modifies the system prompt, scoped narrowly.

Stale facts. Memory has no built-in TTL. The user moves from Karachi to London; the old fact “user lives in Karachi” still scores high on semantic similarity. Either add a recency weighting to retrieval or expose an update_fact tool the agent can call when it notices a contradiction.

Memory writes during failed runs. If a tool errors halfway through a session, what gets written? The conservative default: write episodic memory only on clean session end. The middle ground: write the buffer to a “draft” episode that is finalized on clean end and discarded on hard failure. The aggressive default (write every turn) is rarely worth the operational cost of cleaning up bad rows.

The takeaway

Memory is not one feature you add to an agent. It is four small stores, each solving a different problem, with different lifetimes and different shapes. Build the conversation buffer first; it is what most agents need for the first month. Add the other three only when a specific failure forces the upgrade. Expose memory reads and writes as tools the agent decides to call, not as state the system mutates silently, because the trace is the only honest debugger you will have.

The senior position: every memory write is a deliberate, logged event. Every memory read is a tool call with arguments. There is no “the system just remembers.” If there is, you have built something you cannot debug. And in agentic AI, debuggability is the difference between a demo and a system you can sleep next to.

Share this post LinkedIn · X · Email ·

Frequently asked

Quick answers

What is the simplest agent memory I should start with?
The conversation buffer: a Python list of message dicts that you pass to the model on every turn. That is the entire short-term memory most agents need for the first month. Vector stores, episodic logs, and key-value preference stores are answers to specific problems you will discover once the buffer alone stops working. Build the buffer first; let the failure modes tell you which memory to add next.
Do I need a vector database from day one?
No. Vector retrieval solves a specific problem: "find a relevant past fact across thousands of items that do not fit in the context window." If your agent runs short tasks against a small corpus, you do not have that problem yet. SQLite or a JSON file holds the first thousand items just fine. Add vector retrieval the day a keyword search stops finding the right item, not on principle.
Should memory be a tool the agent calls or something the system writes silently?
Writes should almost always be explicit tools the agent calls. Silent writes ("the system automatically saves user preferences") cause invisible state changes that are impossible to debug and easy to abuse for prompt-injection. Reads can be either: a tool the agent calls when it needs context (preferred for semantic and episodic memory) or auto-included on every turn (acceptable for very small, durable preference state). The rule: if you cannot point at a log line showing the write happened and why, the write is silent and you should fix that.
How do I prevent the context window from filling up on a long task?
Three layers, applied in order. First, cap every tool result at a few hundred tokens; unbounded tool outputs are the most common cause of context bloat. Second, summarize the conversation when it crosses a token threshold (replace old turns with a one-paragraph summary). Third, push durable facts out to long-term memory (vector or key-value) and retrieve them as tool calls, so they only re-enter the context when the agent asks for them.
Is "memory" one thing or many?
Many. The single word "memory" hides four distinct stores with four distinct lifetimes: the conversation buffer (this turn), the summary (this task), episodic logs (this user historically), and semantic retrieval (anything searchable). Treating them as one thing is the bug. Treating them as four tools with four schemas, four storage layers, and four interview-grade trade-offs is the production answer.
End · 12 min read ← All posts

Keep reading

Related posts

Beginner ·

Tool Calling From First Principles (Before You Touch LangChain)

Function calling, demystified. The under-the-hood mental model of how a model "calls a tool," a 40-line runnable example with no framework, the four things that go wrong in production, and when reaching for a framework actually helps.