I built a memory store to understand how AI memory actually works
I kept reading about “vector databases” and “semantic search” and “RAG” without really getting it. I understood the words. I didn’t understand the thing.
So I built a minimal version from scratch. A small API that stores memories, embeds them, and retrieves the most relevant ones for a query. The whole thing is Fastify, Postgres, and a Hugging Face model. Nothing exotic.
Here’s what I learned.
The key idea that unlocked everything
When you type “what animals does Sergio like” and get back “Sergio prefers dogs over cats,” no keyword matched. The word “animals” never appears in the stored text.
What happened is that both phrases were converted into vectors — lists of ~384 numbers — where the distance between them in that high-dimensional space corresponds to semantic similarity. The model learned, from training on a huge corpus of text, that “animals” and “dogs over cats” belong in the same neighborhood.
That’s it. That’s the whole idea. Everything else — the database, the index, the search API — is infrastructure around that one insight.
What actually breaks in production (small scale version)
I hit a bug immediately. Searches returned zero results even when relevant memories existed.
The culprit was the vector index. I was using IVFFlat, which works by clustering vectors into buckets during index creation and at query time only searching a subset of those buckets. I had configured 100 buckets with 4 rows in the database. pgvector probed 1 bucket, found nothing, and returned empty.
The fix was straightforward — probe all buckets — but the bug itself was clarifying. It made me understand something I’d glossed over in the docs: approximate nearest-neighbor search isn’t magic. It’s a deliberate tradeoff between recall and speed, and you have to tune it for your data size.
At scale, with millions of vectors, you want approximate search because exact search is too slow. At small scale, the approximation can bite you.
The shape of a production system
What I built is a toy. But the toy has the same skeleton as production memory systems — including the one you’re using right now if you have ChatGPT memory or Claude Projects turned on.
The pipeline is:
- Store: take some text, embed it into a vector, persist both
- Retrieve: take a query, embed it the same way, find the closest stored vectors by cosine similarity
- Inject: pass the retrieved text to the model as context
Step 3 is what “RAG” (Retrieval-Augmented Generation) refers to. Instead of fine-tuning a model on your data, you retrieve the relevant bits at query time and hand them to the model in the prompt. It’s cheaper, more updatable, and easier to debug.
The insight is that the retrieval is doing a lot of the work. A bad retrieval step produces irrelevant context, and no amount of model quality recovers from that.
What the score actually means
Every search result comes back with a cosine similarity score between 0 and 1. “Sergio prefers dogs over cats” scored 0.72 on the query “what animals does Sergio like.” The TypeScript memory scored 0.33 on the same query.
Cosine similarity measures the angle between two vectors, not their distance. Two vectors pointing in the same direction score 1.0 — identical meaning. Perpendicular vectors score 0.0 — unrelated. It’s a clean measure of “does this text mean the same kind of thing as the query.”
The default threshold in my implementation is 0.3. That’s a judgment call. Lower it and you get more results with more noise. Raise it and you get fewer results with higher precision. The right value depends on how you’re using the system — a personal memory store tolerates more noise than a customer-facing product.
Why building this was worth it
I could have kept reading blog posts about vector databases. I’d have understood the vocabulary but not the texture — not the IVFFlat bug, not what a 0.7 score feels like vs a 0.3, not why the index configuration matters at different data scales.
The toy forced me to make decisions that the theory papers over. How many dimensions should the vector have? How do I handle metadata filtering alongside semantic search? What does a bad embedding actually look like at query time?
Those are the questions you can only answer by building something that breaks.
The primitives here — embed, store, retrieve — show up everywhere in the AI tooling space right now. Understanding them at this level makes the larger systems legible in a way that reading about them doesn’t.