Most AI Agent "Memory" Is a Text File. We Built a Brain Instead.

We shipped a new architecture for Kairos: hierarchical memory, parallel threads that fork and communicate, and a consolidation pipeline that turns conversations into long-term knowledge.

·Samarth Patel·8 min read

We just shipped a new architecture for Kairos. We're calling it the Brain.

Not because it sounds cool (though it does), but because it's the most accurate description of what it actually is: hierarchical memory, parallel threads that fork and communicate, and a consolidation pipeline that automatically compresses conversations into long-term knowledge.

This post is the engineering story. What we built, why we built it this way, and why we think the current generation of agent memory solutions is solving the wrong problem.

The 4KB Post-It Note

Before this, Kairos had a 4KB flat text blob for memory.

Every conversation was isolated. The agent couldn't connect what it learned yesterday to what you're asking today. No awareness of other active conversations. No ability to spin off background research while continuing to talk to you.

Every single interaction started from scratch with a tiny post-it note of context.

If you've used any AI agent that claims to "remember" things, this is probably what's happening under the hood. A bunch of text that gets appended to. Eventually it overflows the context window and either gets truncated or summarized into oblivion.

What Everyone Else Is Doing (And Why It's Not Enough)

We looked at what the market was offering for agent memory. The space has matured a lot, and there are some genuinely smart approaches. But they all make tradeoffs we weren't willing to accept.

Extracted fact stores with consolidation. The most common pattern. For every message, an LLM extracts facts and compares them against existing memories via vector similarity. The store can add, update, merge, or delete, so it stays curated rather than growing forever. Some implementations add a graph layer for entity-relationship modeling. These are real systems, not toys. But they operate at one level of abstraction: extracted facts. No temporal hierarchy, no compression across depths, no way to trace a stored fact back to the original conversation that produced it.

Agent-managed virtual memory. An elegant idea: let the LLM manage its own memory like an OS manages RAM and disk. The agent gets tools to read, write, search, and restructure memory across multiple tiers. The tradeoff is that memory quality becomes non-deterministic. It depends entirely on the LLM using those tools well. There's no guaranteed consolidation. And when memory goes wrong, it's hard to debug because the agent made real-time decisions about what to keep and what to discard.

Temporal knowledge graphs. The most architecturally ambitious approach. Build a dynamic knowledge graph with bi-temporal modeling (tracking both when something happened and when the system learned about it), entity resolution, and community detection. Genuinely impressive for temporal reasoning. The tradeoff is operational complexity: persistent graph databases, server infrastructure, and a heavy ingestion pipeline.

The common gap across all three: these systems solve memory retrieval. How do you get the right facts into the context window at the right time? That's important, and they do it well. But none of them address what we think is the other half of cognition: coordination. An agent that perfectly remembers everything but can only think about one thing at a time, in one conversation, is still fundamentally limited.

Our Approach: One Table, Four Tiers, Recursive Compression

Everything in the Kairos Brain is a memory_block. Messages, episode summaries, extracted knowledge, core identity. All blocks in one Postgres table, organized as a recursive tree.

Four tiers:

  • Messages: raw conversation content, summarized per-turn
  • Episodes: compressed conversation summaries (binary tree of merged message pairs)
  • Knowledge: cross-conversation facts and learnings extracted from episodes
  • Core Memory: permanent identity-level understanding ("user prefers TypeScript," "user lives in Toronto," "user books flights through Chase portal")

Each tier is the recursive compression of the level below. And here's the key insight: one consolidation algorithm works at every depth.

consolidate(children) → parent

Messages compress into episodes. Episodes compress into knowledge. Knowledge distills into core identity. Same function, different scale.

This means the system handles 10-message conversations and 10,000-message histories with the same algorithm, just at different tree depths.

Every block gets a 1024-dimensional vector embedding for semantic search. And every core memory is fully traceable. You can walk the parent chain from "user prefers window seats" all the way back to the original conversation where they mentioned it.

Why one table instead of four? Single search index across all memory tiers. Single consolidation pipeline. Parent-chain traceability from core identity back to original messages. The LLM consuming this memory doesn't need different query patterns for different memory types. It's all blocks with types and levels.

The Consolidation Pipeline: How Conversations Become Knowledge

When a conversation ends, a background task automatically runs the consolidation pipeline:

100 messages
  → 50 summary pairs
    → 25 → 12 → 6 → 3 → 1 root episode
      → extract knowledge blocks (facts, preferences, learnings)
        → promote high-confidence knowledge to core memory

Every block gets embedded for semantic search. The whole pipeline uses Claude Haiku for summarization. Fast and cheap.

The numbers: ~200 Haiku calls for a 100-message conversation. Short conversations (~10 messages) cost ~10-20 calls. We're talking about cents per conversation for permanent, structured, searchable memory.

The pipeline runs every 60 seconds as a background task. No user-facing latency. Conversations feel normal. The learning happens asynchronously after you're done.

Over time, this creates a compounding effect. The agent doesn't just remember what you said. It synthesizes patterns across dozens of conversations into increasingly refined understanding. Early conversations might generate broad knowledge blocks. After 50 conversations, the core memory is sharp, specific, and deeply personalized.

Parallel Thinking: Threads That Fork and Communicate

Memory is half the architecture. The other half is the thread system.

Every conversation in Kairos is now a "brain thread." A unit of work with its own lifecycle, priority, and relationships to other threads.

The fork. The agent can spin off sub-threads. Two modes:

  • Blocking: "Let me research that for you." Parent conversation pauses. Fork runs in its own sandbox. Result comes back as a tool response. Clean and synchronous from the user's perspective.
  • Non-blocking: "I'll look into that in the background." Parent continues the conversation immediately. Fork runs independently. When it completes, the result gets injected into the parent as a system notification.

Inter-thread communication. This is where it gets interesting. All active threads for a user share a Redis pub/sub channel. Threads can broadcast messages to each other in real-time.

Practical example: you ask the agent to plan a Tokyo trip. It forks three threads for flights, hotels, and restaurants. The flights thread finds a good deal on ANA departing Thursday. It broadcasts this to all active threads. The hotels thread picks it up and adjusts its search to match the Thursday arrival. The restaurants thread now knows which neighborhood you'll be in first.

No thread is an island. They're aware of each other, they share context through both broadcasts and a shared working memory (Redis key-value store, 24h TTL), and they coordinate without you having to manually relay information between them.

Concurrency control: Up to 20 active threads per user. Excess gets queued with priority-based scheduling. Interactive conversations get priority 10. Forks get 8. Background tasks get 5. If you're at capacity and a high-priority request comes in, the lowest-priority thread gets suspended, its working memory serialized and saved for later resumption.

The "Always-Running" Brain (That Isn't Actually Running)

Here's the design decision I'm most proud of.

The Kairos Brain is not a process. It's a state machine distributed across three data stores (Redis for ephemeral state, Convex for thread lifecycle, Postgres for memory blocks), evaluated on every event.

When a message arrives:

  1. Read brain state: active threads, priorities, pending forks
  2. Read memory: core blocks + semantic search for relevant history
  3. Classify and route: existing thread or new one?
  4. Dispatch: spin up a sandbox with full brain context in the system prompt
  5. On completion: consolidate, update memory, notify other threads

There's no long-running brain service to crash, restart, or scale horizontally. No WebSocket connections to manage. No state synchronization problems. The "always-running" brain is an illusion created by evaluating distributed state on demand.

This matters because stateless architectures are dramatically simpler to operate. Every event is independently processed. If the orchestrator restarts, nothing is lost. All state lives in the data stores. If load spikes, you scale the stateless orchestrator horizontally.

The brain feels persistent and continuous to the user. Under the hood, it's just well-organized state being read and written on every interaction.

What This Actually Feels Like

Strip away the architecture and here's what changes for the user:

"Remember this" actually works. Not as a hack. Preferences, facts, and context persist across conversations and get more refined over time. The agent you talk to after 50 conversations is meaningfully different from the one you talked to on day one.

Complex tasks decompose naturally. "Research flights, hotels, and restaurants for my Tokyo trip" doesn't need to be a single sequential conversation. Three parallel threads, coordinating in real-time, sharing context as they go.

Context carries forward. Start a conversation about your trip, and the agent already knows your flight preferences from last week, your hotel budget from the month before, and that you're vegetarian. Without you repeating any of it.

The agent compounds. Every conversation feeds the consolidation pipeline. Every pipeline run refines the agent's understanding. Over weeks and months, the core memory becomes an increasingly accurate model of how you work, what you prefer, and what you need.

What We're Betting On

Our thesis is that the next wave of useful AI agents won't just be "better at answering questions." They'll be better at three things:

  1. Carrying context over time. Hierarchical memory that compresses and synthesizes, not just stores.
  2. Coordinating complex work. Parallel threads that fork, communicate, and share state.
  3. Learning how you operate. Continuous consolidation that turns conversations into lasting understanding.

That's the direction we're building Kairos toward. The Brain is the foundation.

If you're building agents and want to talk about memory architectures, or if you're running a business and want to try an agent that actually learns, reach out. DMs are open.

Your personal AI intern.

Talk to it like a person. It works like one too. Free to start.