The Closing Window
Meet Sift: A Knowledge Base for Everything That Isn't a Note image
Photo by Rabie Madaci on Unsplash

Meet Sift: A Knowledge Base for Everything That Isn't a Note

AI Insights

TL;DR

Sift is a personal knowledge base I built over three or four months that ingests anything β€” URLs, PDFs, bookmarks, web pages, video and audio files β€” and makes it all searchable by meaning, not just keywords, with awareness of when things were saved. It runs on your own hardware, answers questions using your own saved material as sources, and sits alongside Obsidian rather than replacing it. I've open sourced it at github.com/pablooliva/sift. This post is the story of why it exists, what I learned building it, and what to expect if you clone it.


My Bookmarks Saved Me Money

A few weeks ago, I was designing a multimedia publishing pipeline for this site. I needed voice synthesis and video generation tools, and I had a rough budget in mind, something like $56/month across a few SaaS subscriptions.

Before committing, I searched my Sift knowledge base. It surfaced two ComfyUI integrations I'd bookmarked a week earlier: Qwen3-TTS for local voice cloning and LTX-2.3 for open-source video generation with portrait support. I'd saved them as simple URL bookmarks with short descriptions β€” the lowest-effort form of ingestion possible.

Those two results shifted the entire pipeline from a paid SaaS stack to a $0 local approach.

I didn't remember saving them. I wouldn't have found them with a keyword search. I wasn't searching for "ComfyUI" or "Qwen3-TTS." I was searching for what they do. Sift understood the meaning of my query and matched it to content I'd barely thought about since bookmarking it.

That's what a working external knowledge base actually does. It surfaces the right information at the right moment, even from content you forgot you saved β€” especially when that content comes from the sources you've learned to trust.


What Sift Actually Is

If you've followed this series β€” first post, second post β€” Sift is the next layer: everything that isn't a local text file.

At its simplest: Sift is a personal knowledge base that ingests documents of any type and makes them queryable. You can search it by meaning, or you can ask it a question and get a cited answer drawn from your own material.

It runs on your own hardware, mostly. I run mine on a home server, but it works on any machine with Docker. I do use an LLM inference provider for some AI workflows, but your data stays yours.

It's not an Obsidian replacement. It's a companion. Obsidian holds my working notes β€” things I write, think through, and link together. Sift holds everything else β€” things I've encountered, saved, and want to be able to find later.


How It Got Built

Sift started with two questions.

The first was practical: what if I had a single bucket for everything that crosses my path, regardless of medium? A place where a URL, a PDF, and a text note all end up searchable in the same system?

The second was methodological: could I build a real project using spec-driven development (SDD) with an AI coding agent from start to finish? Not a quick prototype, but a properly engineered system with formal specifications, critical reviews, and documented decisions?

Three months and 549 commits later, the answer to both is yes, with caveats.

The Technology Choice

I chose txtai as the foundation because it was batteries-included: embeddings, vector search, and a web API in a single Python package with minimal dependencies. No separate vector database to configure, no extra services to run. For a proof of concept, that mattered.

What I discovered quickly was what "batteries-included" actually meant in practice. Some of the default models were outdated. Parts of the stack underperformed compared to standalone alternatives. The convenience that got me started became a constraint as the system matured.

This led to heavy customization. I swapped the vector backend to Qdrant for better performance and persistence. I integrated Graphiti and Neo4j for knowledge graph capabilities β€” understanding not just what I've saved, but how concepts relate to each other. I added Together.ai as the LLM backend for RAG generation, because local models weren't good enough for synthesizing answers from retrieved context.

In hindsight, I probably should have dropped txtai entirely at some point and built a custom solution from scratch. But there's a lesson in that decision too: sometimes the cost of starting over exceeds the cost of working around limitations, especially when the system is already delivering value. The ComfyUI story happened with these limitations in place.

The Engineering Process

The less obvious story is the development process itself. Sift has 45+ formal specification documents β€” requirements, research, critical reviews, implementation summaries. Every significant decision was written down before code was written. Every spec was critically reviewed, often by Claude Code acting as a second perspective on the design. A Claude Code plugin I had developed made this development workflow run smoothly.

This sounds like overkill for a personal project, and maybe it is. But the SDD artifacts turned out to be valuable in ways I didn't expect. When I hit a bug six weeks after a design decision, the spec told me why I'd made that choice. When I wanted to swap a component, the research document told me what alternatives I'd already evaluated and rejected. The process documentation became its own form of knowledge management.


Enjoy Cautiously

I'm open sourcing this because I think it's useful, not because it's polished. Here's what you should know:

txtai has real limitations. Some models in the default stack are outdated, and if I were starting today, I'd probably build on a different foundation. Running the original txtai application required 20 GB of VRAM, being consumed constantly. This was unacceptable and one of the biggest reasons why I embarked on all the refactoring, in addition to adding and enhancing features. The customizations I've made work and they are integrated relatively cleanly, but they add complexity that wouldn't exist with a leaner architecture.

The knowledge graph is promising but expensive. Graphiti uses LLM calls to extract entities and relationships from documents; roughly 12-15 calls per chunk. That adds up fast. Importing all of my Obsidian notes into Sift's knowledge graph is something I want to do but haven't, because the API costs are non-trivial. The graph currently has 796 entities and only 19 relationships, which means 97.7% of nodes are isolated. It works, but it's not yet dense enough to surface the kind of connections a mature knowledge graph should.

It runs on a home server. The Docker setup needs a machine with reasonable specs. This isn't a lightweight app. I'll document the hardware requirements in the repo, but expect to need more than a Raspberry Pi.

It's a personal project with personal trade-offs. Some decisions optimize for my workflow specifically. Some code could be cleaner. The SDD specs document the reasoning, so you can judge for yourself whether my trade-offs match your needs.


Why Open Source

A few reasons, and they're more personal than strategic.

The KM series makes claims about building a production-grade system. Without a public repo, you have to take my word for it. With one, you can verify, clone, and run it yourself.

The engineering process β€” the 45+ specs, the critical reviews, the research documents β€” is as interesting as the code. I haven't seen another personal project on GitHub that ships with this level of process documentation. If the SDD approach resonates with you, the SDD/ directory is where to look.

And honestly, nothing quite like this exists in the ecosystem. There are RAG demos, there are txtai examples, there are knowledge graph tutorials. But this system combines txtai + Graphiti + Neo4j + Qdrant + MCP server + consumption by an AI agent. Don't all these technologies used together sound fascinating? If your answer is yes... then nerd recognize nerd.

The repo is live at github.com/pablooliva/sift.


What's Next

This post is the announcement. The technical deep-dive comes in the next post in the series, where I'll walk through Sift's architecture in detail. If you want to understand how it works before you clone it, that's the one to wait for.

After that, I'll cover how MCP and Claude Code tie Sift, CK Search, and Obsidian into a unified system that an AI agent can query across all of them simultaneously. That's where the individual tools become more than the sum of their parts.

For now, if you're curious, the repo has a README and setup instructions. Clone it, break it, drool over it. That's what it's there for.


This post is part of the AI-Powered Knowledge Management Series. Previous: Finding Meaning in Your Notes with CK Search. Next: Beyond Obsidian β€” Building an External Knowledge Base with Sift.

Powered by Buttondown.