RAG Architecture

jonas
Tits&Tats
April 20, 2025

Long time no hear? Well, I’ve been chasing down many bad leads for RAG systems. Some performed terribly, others required a lot of hassle for minimal return, and in many cases, the differences were only noticeable in edge cases.

But in the end, I managed to get a system up and running that I’m comfortable with. Let’s start at the beginning. The focus of Project Skald is an AI chatbot to help me navigate my pen-and-paper notes.

Goal

I want to ask the chatbot a simple question. My go-to example is a short note about an adventure set in a Victorian fantasy world. If you don’t know Fallen London, make sure to check it out!

The question I want answered is: “What is the relationship between the ambitious muse and the transfigured poet?”

In the notes, both characters have short biographies, and in one adventure, they interact. The poet becomes the muse’s new protégé. Unfortunately, he soon vanishes, and the muse asks the players to find him. Oh—and she provides him with a drug to enhance his poetry.

I want a clear answer that states their relationship. A little context is fine, but I don’t want to read an entire paragraph (otherwise, I could just search my notes myself). And hallucinations are a hard no. The AI must not invent information. Sometimes the AI says things like:

“While the text does not clearly state this, the poet might be romantically interested in the muse.”

This is tricky. On one hand, the AI clearly marks it as speculation. And often, these guesses are based on interpretations of the text. Sometimes, the AI even spots valid inferences that I deliberately left vague.

Architecture

If you ask a public AI—like DeepSeek or ChatGPT—about the relationship between the muse and the poet, it’ll give you absolute nonsense. That’s fine—they don’t have my notes. ChatGPT suggests many plausible ideas, but none of them are actually correct.

But if you give the AI context, it can answer very well. When I say “AI,” I really mean a Large Language Model (LLM). These models are incredibly good at generating text—especially at summarizing and extracting relevant information. So if I pass the full note on the muse and poet to the LLM, and then ask my question, the answer is solid. That’s actually most of the “magic.” The LLM effectively summarizes a relevant text search for me.

This just shifts the problem: the quality of the LLM’s response depends heavily on the quality of the provided context. And now we need a way to automatically find good context. This is where RAG comes into play.

Retrieval-Augmented Generation (RAG) builds a context with all the relevant information for the LLM. It searches a database and computes distances between my question and the stored text chunks. A small distance means high correlation. We then take the top 10 most relevant chunks and pass them to the LLM.

Building the database is relatively easy. An embedding model—similar to an LLM—computes an embedding for each text chunk. Based on these embeddings, we can compute similarity distances. Initially, I just split the text by character count and saved the chunks.

Optimisation

The first RAG output wasn’t great, so I started tweaking parameters. There are a lot—and very few guides suggest sensible defaults.

For the LLM part:

Model selection: I can run most 7B and some 12B models. Larger models are slower but more capable. Both handle grammar and summarization well. The 12B models often pick up implied meaning not explicitly stated.
Context window: How many chunks do you pass to the LLM? Too many = irrelevant info; too few = missing key info. This must be tuned for each model.
History: In a chat, you usually include previous interactions. But for RAG, it’s less clear. The similarity search only sees the current query. If I ask, “What is the relationship between the muse and the poet?”, and then follow up with “Could he love her?”, the RAG system doesn’t know who “he” is. The LLM might infer from the context, but the RAG doesn’t.

Even if the LLM is configured perfectly, RAG itself offers more tuning:

Chunk sizes: Larger chunks = more context, but may dilute focus. Smaller chunks = risk missing full meaning. Since chunks are split by length, not content, this is tricky.
Chunk overlap: Overlapping ensures that important info on the edge of one chunk is captured fully in another.
Embedding models: Different models suggest different chunks. I had trouble getting many to run, but even among those that did, the quality of results varied.

So yes—many parameters to tweak. And since the goal is to overengineer everything, the question isn’t “should I tweak this?” but “can I?”

Final Setup

I went with a Python-based approach. The LLM runs via Ollama on my desktop machine. With up to 12B models, I can test a wide variety. I don’t plan to connect to cloud-based AIs.

The retrieval and database are handled by ChromaDB. Why? It was the easiest to set up. (Maybe I’ll share my Python struggles later—it’s not as portable as I hoped, especially for performant local databases.)

All parameters can be adjusted via a Gradio web UI.

I’m working on getting everything running in a Docker container. Once that’s done, I’ll share the project on GitHub. Until then, I still need to learn a bit more Python packaging and probably write up some documentation in case someone wants to build on the project.

Next up: How I test parameters and deal with even more LLM-related headaches.

ai skald