How to Build an Agentic RAG with RubyLLM and Rails | Giovanni Panasiti

I run a RAG application for Italian pension and tax consultants. Users ask questions about INPS, professional pension funds, laws and regulations, and the app answers using a knowledge base of uploaded documents.

For a long time the app used the classic single-shot RAG pipeline: take the question, search the database, stuff the results into a system prompt, ask the model. It works, but it has a hard limit: the retrieval happens once, before the model has any chance to reason about the question. If the first search misses, the answer is bad and there is nothing the model can do about it.

So I rebuilt the pipeline as an agent. Now the model drives the retrieval itself: it decides what to search, reads the results, searches again with different terms, follows cross references between documents, and only then writes the answer. All in plain Ruby, with RubyLLM and Rails. No LangChain, no Python sidecar.

In this article I will show you exactly how it works, with the real code from my application. One note before we start: since the app serves Italian consultants, all the prompts, tool descriptions and user-facing strings are in Italian in the real codebase. I translated them to English here so you can follow along, but the structure is identical.

The stack

Rails 8.1, Ruby 3.3
RubyLLM ~> 1.15 for everything LLM related (chat, tools, embeddings, streaming)
PostgreSQL with pgvector, through the neighbor gem, for vector search
pg_search for full-text search (Italian dictionary)
Turbo Streams to stream the answer and the agent steps to the browser
A background job (Active Job) that runs the whole thing off the request cycle

Single-shot RAG vs agentic RAG

Quick recap of the two approaches, because the agentic version reuses almost everything from the single-shot one.

Single-shot RAG:

User asks a question
You embed the question and search your chunks
You build a context block from the top results
You put the context in the system prompt and call the LLM once

Agentic RAG:

User asks a question
You give the LLM a set of tools: search_knowledge_base, fetch_document_section, list_documents
The LLM calls the tools as many times as it needs, in a loop
When it has enough material, it writes the final answer with citations

If you never worked with tool calling (also called function calling), here is the mental model. A “tool” is just a function you describe to the model: a name, a description, and the parameters it accepts. The model cannot run anything itself. When it decides it needs a tool, it stops generating text and replies with a structured request like “call search_knowledge_base with query: 'aliquote 2024'”. Your code runs the function, sends the result back as a new message, and the model continues from there. It can ask for another tool, or write the final answer.

The key insight: you do not need to write that loop yourself. RubyLLM already has a tool loop built in. When you call chat.complete on a chat that has tools registered, RubyLLM sends the tool definitions to the model, executes the tools the model asks for, feeds the results back, and repeats until the model produces a normal text answer. Your job is only to write good tools and a good system prompt.

Step 1: the retrieval primitive

The agent is only as good as its search tool. Mine is a hybrid search on the Chunk model that combines vector similarity and full-text search, merges them with Reciprocal Rank Fusion, and reranks for diversity with MMR. This is the same method the old single-shot pipeline used, so the agent got all of it for free.

A quick word on the data model first. When a user uploads a document, a background job splits it into Chunk records. Each chunk holds a piece of text (content), its order in the document (position), and an embedding: a vector of numbers, produced by an embedding model, that represents the meaning of the text. Texts with similar meaning get vectors that point in similar directions, so “find chunks similar to this question” becomes “find the nearest vectors”, which pgvector can do with an index.

The model setup looks like this:

class Chunk < ApplicationRecord
  include PgSearch::Model

  belongs_to :document
  belongs_to :parent_chunk, class_name: "Chunk", optional: true

  has_neighbors :embedding

  pg_search_scope :text_search,
    against: :content,
    using: {
      tsearch: {
        prefix: true,
        dictionary: "italian",
        tsvector_column: "searchable"
      }
    }
end

Line by line:

has_neighbors :embedding comes from the neighbor gem. It tells Rails that the embedding column is a pgvector column and unlocks the nearest_neighbors query method, which I use later for the vector search.
belongs_to :parent_chunk is a self-reference: a chunk can point to a bigger chunk of the same document. This is the “small-to-big” pattern, I will explain it when we get to the fetch_document_section tool.
pg_search_scope :text_search defines classic full-text search on content. The three options matter: dictionary: "italian" makes PostgreSQL stem Italian words (“contributi” matches “contributo”), prefix: true matches partial words, and tsvector_column: "searchable" points to a precomputed column so PostgreSQL does not re-tokenize the text on every query.

And the heart of the search (simplified, I removed the metadata tracking to keep it readable):

def self.hybrid_search(query_text, user:, limit: 10, category: nil, use_mmr: true)
  # Phase 1: embed the query (cached, embeddings are deterministic per model+text)
  embedding_model = RubyLLM.config.default_embedding_model
  cache_key = "qemb:#{embedding_model}:#{Digest::SHA256.hexdigest(query_text)}"
  query_embedding = Rails.cache.fetch(cache_key, expires_in: 1.hour) do
    RubyLLM.embed(query_text).vectors
  end

  # Phase 2: vector search (pgvector via the neighbor gem)
  vector_results = vector_search(query_embedding, user: user, limit: limit * 3, category: category)

  # Phase 3: full-text search (pg_search, Italian dictionary)
  text_results = fulltext_search(query_text, user: user, limit: limit * 3, category: category)

  # Phase 4: merge the two rankings with Reciprocal Rank Fusion
  merged = rrf_merge(vector_results, text_results)

  # Phase 5: drop results below a similarity threshold
  filtered = apply_score_threshold(merged, query_embedding, similarity_cache)

  # Phase 6: MMR rerank for diversity
  use_mmr ? mmr_rerank(filtered, query_embedding, similarity_cache, limit: limit)
          : apply_document_cap(filtered, limit: limit)
end

Let me walk through the phases, because each one fixes a specific failure mode:

Phase 1 turns the query text into a vector with RubyLLM.embed(query_text).vectors. That is one API call to the embedding model, the only external call in the whole search.
Phases 2 and 3 run two searches in parallel conceptually: vector search finds chunks that mean the same thing even with different words, full-text search finds chunks that contain the exact words (crucial for law numbers, acronyms, article references that embeddings handle badly). Both over-fetch with limit * 3, because the next phases will filter and we want enough candidates to survive.
Phase 4 merges the two ranked lists into one (more on RRF below).
Phase 5 drops chunks whose cosine similarity to the query is too low. Without this, a question about something that is simply not in the knowledge base would still return the “least bad” chunks, and the model would try to answer from garbage. Returning nothing is better: the agent sees “No results” and can reformulate or say it does not know.
Phase 6 is MMR (Maximal Marginal Relevance). Plain similarity ranking tends to return five almost identical chunks from the same page. MMR picks results one at a time, each time choosing the candidate that is relevant to the query and different from what was already picked. You get coverage instead of repetition.

Two details worth copying:

Cache the query embedding. Embeddings are deterministic for the same model and text, so a one-hour Rails cache entry saves an API call every time the agent (or two users) repeats a query. This matters a lot in the agentic version, because the agent often searches several times per question.

RRF is dead simple and works. You do not need to normalize scores between vector search and full-text search. You only use the rank of each result in each list:

def self.rrf_merge(vector_results, text_results)
  scores = Hash.new(0.0)

  vector_results.each_with_index do |chunk, rank|
    scores[chunk.id] += 1.0 / (RRF_K + rank + 1)   # RRF_K = 60
  end

  text_results.each_with_index do |chunk, rank|
    scores[chunk.id] += 1.0 / (RRF_K + rank + 1)
  end

  all_chunks = (vector_results + text_results).uniq(&:id)
  all_chunks.sort_by { |chunk| -scores[chunk.id] }
end

The idea: a chunk ranked 1st in a list gets 1/61 points, ranked 2nd gets 1/62, and so on. A chunk that appears in both lists collects points from both, so it floats to the top of the merged ranking. The constant RRF_K = 60 flattens the curve so that being 1st versus 3rd is not a huge difference; it is the standard value from the original RRF paper and I never needed to tune it. Hash.new(0.0) just makes a hash that returns zero for missing keys, so += works without initialization.

One more thing that pays off for Italian content: I expand domain acronyms before the full-text search, because “TFR” and “trattamento di fine rapporto” need to match the same chunks:

DOMAIN_SYNONYMS = {
  "TFR"   => "trattamento di fine rapporto TFR",
  "NASpI" => "Nuova Assicurazione Sociale per l'Impiego NASpI",
  # ...
}.freeze

Step 2: write the tools

In RubyLLM a tool is a class that inherits from RubyLLM::Tool. You declare a description and parameters with a small DSL, and you implement execute. Three class-level calls define the contract with the model:

description is the text the model reads to decide when to call the tool. It is not documentation for humans, so write it as an instruction (“use this BEFORE answering”, “call it multiple times for compound questions”).
param :query, desc: "...", required: true declares one parameter. RubyLLM turns these declarations into the JSON schema that gets sent to the provider with every request.
execute(query:, category: nil) is the method RubyLLM calls when the model asks for the tool. The keyword arguments match the declared params, already parsed from the model’s JSON. Whatever string you return becomes the tool result message the model reads next.

My agent has four tools. Here is the main one, the search tool:

module Rag
  module Tools
    class SearchKnowledgeBase < RubyLLM::Tool
      MAX_TOOL_CALLS = 6
      SEARCH_LIMIT = 15

      description <<~DESC
        Search the knowledge base (Italian regulatory documents, circulars, pension regulations).
        Use this tool BEFORE answering. For compound questions or cross-references,
        call it multiple times with different, targeted queries. Returns relevant passages with their source document.
      DESC
      param :query, desc: "The search query, specific and targeted", required: true
      param :category, desc: "Optional category name to narrow the search", required: false

      def initialize(user:, category:, collector:)
        @user = user
        @category = category
        @collector = collector
        super()
      end

      def name
        "search_knowledge_base"
      end

      def execute(query:, category: nil)
        return limit_message if @collector.limit_reached?(MAX_TOOL_CALLS)

        @collector.increment_tool_call

        cat = resolve_category(category)
        chunks = Chunk.hybrid_search(query, user: @user, limit: SEARCH_LIMIT, category: cat, use_mmr: true)
        @collector.record(chunks)

        return "No results for: #{query}" if chunks.empty?

        format_sources(chunks)
      rescue StandardError => e
        Rails.logger.warn "[Agent] search_knowledge_base failed: #{e.message}"
        "Search failed for '#{query}'. Try again with different terms."
      end

      private

      def format_sources(chunks)
        chunks.map.with_index do |c, i|
          loc = c.page_number ? " (p.#{c.page_number})" : ""
          "[S#{i + 1}] #{c.document.title}#{loc} [doc:#{c.document_id}]\n#{c.content.to_s.truncate(700)}"
        end.join("\n\n")
      end

      def limit_message
        "Search limit reached. Answer now with the information you already collected, citing the sources."
      end
    end
  end
end

Reading execute top to bottom: first it checks the budget and refuses politely if the agent already searched six times. Then it counts this call, resolves the optional category name to a Category record, and delegates the actual searching to Chunk.hybrid_search from Step 1. The chunks go into the collector (so the UI can show references later), and format_sources turns them into the text the model will read. The output looks like this:

[S1] Title of the first document (p.12) [doc:42]
...first 700 characters of the chunk...

[S2] Title of the second document (p.3) [doc:17]
...first 700 characters of the chunk...

Each entry has a stable label ([S1]), the document title, the page when available, the [doc:N] tag, and the chunk text truncated to 700 characters.

Several decisions in this class come from real problems, so let me explain them.

Tools take collaborators in the constructor. RubyLLM only cares about execute, so the constructor is yours. I pass the current user (search must be scoped to documents the user can access), the chat category, and a collector object that accumulates state across all tool calls of one run. More on the collector below.

The name method is overridden on purpose. By default RubyLLM derives the tool name from the class name, including the namespace, so Rag::Tools::SearchKnowledgeBase would become something like rag--tools--search_knowledge_base. The model sees this name and the system prompt references it, so I pin it to a clean "search_knowledge_base" that matches the prompt exactly.

There is a hard budget on tool calls. MAX_TOOL_CALLS = 6. Without a budget, a model that is not finding what it wants can keep searching forever and you pay for every round trip. The important detail is how the limit is enforced: the tool does not raise, it returns a message telling the model to stop and answer with what it has. The model reads tool results, so the tool result is your channel to steer it.

Tools never raise. Every execute rescues StandardError and returns a short message in plain language. If a tool raises, the whole agent run dies and the user gets nothing. If it returns “Search failed, try again with different terms”, the model just tries a different query. Errors become something the agent can recover from.

Results carry machine-readable handles. Each result is tagged [doc:#{c.document_id}]. That is not for the user, it is for the model: the next tool (fetch_document_section) takes a document_id parameter, and the model copies the id from the tag. This is how tools compose.

The follow-the-reference tool

Italian regulations love cross references: “art. 3 refers to art. 7”. A single search will give you art. 3 but not art. 7. So the agent has a second tool that reads more context around a position in a specific document:

module Rag
  module Tools
    class FetchDocumentSection < RubyLLM::Tool
      NEIGHBOR_RADIUS = 1

      description <<~DESC
        Read more context from a specific document: useful to follow a reference
        (e.g. "art. 3 refers to art. 7") or to dig deeper into a search result.
        Provide the document_id (from search_knowledge_base results, tag [doc:N]) and the position to read around.
      DESC
      param :document_id, type: "integer", desc: "ID of the document (from the [doc:N] tag)", required: true
      param :around_position, type: "integer", desc: "Position of the chunk to read around", required: false

      def execute(document_id:, around_position: nil)
        doc = Document.for_user(@user).where(status: "completed").find_by(id: document_id)
        return "Document not accessible or does not exist." unless doc

        chunks = section_chunks(doc, around_position)
        return "No content found in the document." if chunks.empty?

        @collector.record(chunks)
        chunks.map { |c| c.content.to_s.truncate(1200) }.join("\n\n")
      end

      private

      def section_chunks(doc, position)
        scope = doc.chunks
        if position
          center = scope.find_by(position: position)
          parent = center&.parent_chunk
          return [ parent ] if parent

          range = ((position - NEIGHBOR_RADIUS)..(position + NEIGHBOR_RADIUS))
          scope.where(is_parent: false, position: range).order(:position).to_a
        else
          scope.where(is_parent: false).order(:position).limit(3).to_a
        end
      end
    end
  end
end

Note the security check on the first line of execute. The model supplies the document_id, and a model parameter is user input as far as authorization is concerned. The tool re-scopes the lookup with Document.for_user(@user), so the agent can never read a document the current user could not open in the UI. Never trust an id just because it came from your own agent: the model copies ids from text, and text can be wrong or manipulated.

The interesting logic is in section_chunks, which decides what “more context” means:

If the model gave a position, find the chunk at that position and check if it has a parent_chunk. If yes, return the parent: this is the “small-to-big” pattern. Documents are chunked twice, into small searchable chunks and larger parent chunks. Search runs on small chunks (better precision, the embedding is focused), but when the agent wants context it gets the whole parent section (better recall).
If there is no parent (short documents are not double-chunked), return the neighbor chunks instead: positions from position - 1 to position + 1, in document order. That gives the model the text right before and after the match.
If the model gave no position at all, return the first three chunks of the document, which is usually the title and the opening definitions.

Notice that is_parent: false appears in every query on small chunks. Parent chunks must never show up as search results on their own, they exist only as expansion targets, so they are excluded everywhere except the explicit parent lookup.

The orientation tool

The third tool is trivial but useful, it lists the available documents so the agent knows what sources exist before searching:

def execute(category: nil)
  scope = Document.for_user(@user).where(status: "completed", hidden: false)
  if category.present? && (cat = Category.find_by("LOWER(name) = ?", category.downcase))
    scope = scope.where(id: cat.document_ids)
  end

  docs = scope.order(:title).limit(LIMIT).to_a
  return "No documents available." if docs.empty?

  docs.map { |d| "- #{d.title} [doc:#{d.id}]" }.join("\n")
end

The note tool: make the agent think out loud

The fourth tool surprised me with how useful it is. It does almost nothing:

module Rag
  module Tools
    class Note < RubyLLM::Tool
      MAX_NOTES = 12

      description <<~DESC
        Write down a short reasoning or evaluation: why you are about to run a search,
        how you judge the results, which document is worth digging into and why, or the
        conclusion you are reaching. Use it to make your reasoning explicit.
        It is NOT the answer to the user: it only serves to trace the decision process.
      DESC
      param :thought, desc: "The reasoning/evaluation, one or two sentences", required: true

      def execute(thought:)
        return "You have noted enough; proceed with a search or with the answer." if @collector.reasoning_count >= MAX_NOTES

        @collector.add_reasoning(thought)
        "Noted."
      end
    end
  end
end

The model calls note to explain why it is about to search something, how it judges the results, and what conclusion it is reaching. The notes never reach the final answer, they go into a debug trail stored with the message. When a user reports a bad answer, I open the message and I can read the agent’s own reasoning: which searches it ran, what it thought of the results, why it picked one document over another. Debugging an agent without this is guesswork.

It also does not count against the search budget, so the model never has to choose between thinking and searching.

Step 3: the collector, shared state across tool calls

The tools are independent objects but they need to share state during one run: which chunks were retrieved (to render the references panel under the answer), the step trail (for the UI), and the tool-call counter (for the budget). I use one plain Ruby object passed to every tool:

module Rag
  class RetrievalCollector
    attr_reader :cited_chunk_ids, :steps, :tool_calls

    def initialize
      @cited_chunk_ids = []
      @seen = {}
      @steps = []
      @tool_calls = 0
      @searched = false
    end

    def record(chunks)
      @searched = true
      chunks.each do |chunk|
        next if @seen[chunk.id]
        @seen[chunk.id] = true
        @cited_chunk_ids << chunk.id
      end
    end

    def add_step(kind:, label:, detail: nil, results: nil, ms: nil, timings: nil)
      step = { kind: kind, label: label.to_s.truncate(160), detail: detail }
      step[:results] = results if results.present?
      step[:ms] = ms if ms
      @steps << step
    end

    def increment_tool_call
      @tool_calls += 1
    end

    def limit_reached?(max)
      @tool_calls >= max
    end

    def searched?
      @searched
    end
  end
end

Nothing clever here, and that is the point. The agent run is a single job execution, so a plain object with arrays is all the state management you need. No global state, no thread locals, easy to test.

A few details worth pointing out:

record deduplicates with the @seen hash but keeps first-appearance order. The same chunk often comes back from two different searches, and I want it cited once, in the position where the agent first found it.
searched? answers one question: did the agent retrieve anything during this run? I store it in the message metadata, so when an answer has no citations I can immediately tell apart “the agent never searched” (a prompt problem) from “it searched and found nothing” (a knowledge base problem).
The counter plus limit_reached? is the whole budget mechanism. The collector does not enforce anything by itself, each tool decides what to do when the limit is hit. That keeps the policy next to the tool it belongs to.

Step 4: the agent class

Now the piece that ties it together. Remember, RubyLLM owns the loop. This class only wires things up: it registers the tools, sets the system prompt, subscribes to events, runs complete, and returns what the collector accumulated.

module Rag
  class Agent
    AGENT_TEMPERATURE = 0.2

    def run(chat:, user:, query:, category:, on_content:, on_step:)
      collector = RetrievalCollector.new
      tools = build_tools(user: user, category: category, collector: collector)

      llm = chat
      llm.with_instructions(AgentPromptBuilder.build(category: category), replace: true)
      llm.with_tools(*tools, replace: true)
      llm.on_tool_call { |tool_call| on_step.call(step_for(tool_call)) }

      llm.with_temperature(AGENT_TEMPERATURE).complete do |chunk|
        on_content.call(chunk.content) if chunk.respond_to?(:content) && chunk.content.present?
      end

      {
        cited_chunk_ids: collector.cited_chunk_ids,
        steps: collector.steps,
        searched: collector.searched?
      }
    end

    private

    def build_tools(user:, category:, collector:)
      [
        Tools::SearchKnowledgeBase.new(user: user, category: category, collector: collector),
        Tools::FetchDocumentSection.new(user: user, collector: collector),
        Tools::ListDocuments.new(user: user, collector: collector),
        Tools::Note.new(collector: collector)
      ]
    end

    def step_for(tool_call)
      label =
        case tool_call.name
        when "search_knowledge_base" then "🔍 Searching: #{tool_call.arguments['query']}"
        when "fetch_document_section" then "📄 Reading a document"
        when "list_documents" then "📚 Available documents"
        when "note" then "💭 #{tool_call.arguments['thought'].to_s.truncate(140)}"
        else tool_call.name
        end
      { kind: tool_call.name, label: label }
    end
  end
end

Follow run from top to bottom. It creates one fresh collector per run and builds the four tool instances around it, so every tool writes into the same state. Then it configures the chat object: system prompt, tools, a tool-call listener, temperature, and finally complete, which blocks until the whole agent loop is done. When complete returns, the collector holds everything that happened, and run packages it into a plain hash for the caller.

The RubyLLM API surface used here is small:

with_instructions(prompt, replace: true) sets the system prompt. replace: true swaps any previous system message instead of appending a second one.
with_tools(*tools, replace: true) registers the tool instances. From here on, every request to the provider includes their JSON schemas.
on_tool_call { |tc| ... } fires every time the model decides to call a tool, before the tool runs. The tool_call object gives you the tool name and the parsed arguments hash, which is what step_for uses to build a human-readable label like “🔍 Searching: contribution rates 2024”. I push that to the UI so the user sees what the agent is doing during the long wait.
complete { |chunk| ... } runs the whole tool loop. The block receives streaming text chunks as the model generates them. Tool-call rounds produce little or no text, so in practice the block fires mostly while the final answer is being written, and the UI streams it word by word.

Note what this class does not do: it never loops, never parses tool arguments, never decides which tool to run. RubyLLM does all of that. The class builds objects, subscribes to two events, and reads the collector at the end.

The two callbacks (on_content, on_step) are injected by the caller as lambdas, so the agent class knows nothing about Turbo Streams or messages. You could run the same agent from a rake task with on_step: ->(s) { puts s[:label] } and nothing would change. That keeps it testable.

Step 5: the system prompt

The prompt is where you turn a chat model into an agent that behaves. In the real app it is written in Italian (the model answers Italian consultants, and prompting in the answer language works better); here is the English translation, the structure is what matters:

module Rag
  class AgentPromptBuilder
    def self.build(category:)
      scope = category ? "Questions are about the category: #{category.name}.\n" : ""
      <<~PROMPT
        ## Role
        You are an expert agent for Italian pension, tax and regulatory consulting ( , INPS, professional pension funds, circulars, regulations). #{scope}

        ## Available tools
        - `search_knowledge_base(query, category?)`: searches for relevant passages. ALWAYS USE IT before answering.
        - `fetch_document_section(document_id, around_position?)`: read more context from a document, to follow references (e.g. cross-references between articles) or dig deeper.
        - `list_documents(category?)`: lists the available sources.
        - `note(thought)`: write down your reasoning in one sentence.

        ## Mandatory procedure
        1. **Search before answering.** Never answer without calling `search_knowledge_base` at least once.
        2. **Note your reasoning** with `note` at the key moments.
        3. **Decompose** compound questions: run separate, targeted searches for each sub-question.
        4. **Follow references**: if a passage points to another article/document, use `fetch_document_section` to read it.
        5. **Always cite sources** with the format `[[Document name]]` for every claim.
        6. **If after searching the context is insufficient**, say so clearly. Do NOT invent data, deadlines, rates or regulatory references.
        7. Answer in **Italian**, with appropriate technical terminology.
      PROMPT
    end
  end
end

The rules that earn their place:

“Search before answering, always.” Without rule 1, the model sometimes answers from its own training data, which for pension regulations is a liability.
“Decompose compound questions.” This is the whole reason to go agentic. “What are the rates and how do they compare with INPS gestione separata?” should produce two targeted searches, not one mushy one.
“Follow references.” This tells the model that fetch_document_section exists for the cross-reference case, mirroring the tool description.
“Say when you do not know.” Combined with the no-results message from the search tool, this gives users an honest “I do not have enough information in the available documents” instead of an invented deadline.

Step 6: run it in a background job and stream to the browser

The agent loop takes anywhere from a few seconds to half a minute, so it runs in a job. The controller only creates the user message and enqueues:

def create
  @message = @chat.messages.create!(role: "user", content: content)
  ChatResponseJob.perform_later(@chat.id, content)
end

The job decides between the old pipeline and the agent with a runtime setting, which gave me a safe rollout and an instant kill switch:

def perform(chat_id, content)
  chat = Chat.find(chat_id)
  user = chat.user

  if Setting.agentic_rag_enabled?
    run_agentic(chat, user, content)
  else
    run_single_shot(chat, user, content)
  end
end

And the agentic path:

def run_agentic(chat, user, content)
  # Eager assistant message so step/content broadcasts have a target.
  assistant_message = chat.messages.create!(role: "assistant", content: "")
  accumulated = +""

  on_content = lambda do |text|
    accumulated << text
    assistant_message.broadcast_streaming_content(accumulated)
  end
  on_step = ->(step) { assistant_message.broadcast_agent_step(step) }

  llm = build_agent_chat(chat)

  result = { cited_chunk_ids: [], steps: [], searched: false }
  begin
    result = Rag::Agent.new.run(
      chat: llm, user: user, query: content, category: chat.category,
      on_content: on_content, on_step: on_step
    )
  rescue StandardError => e
    Rails.logger.error "[Agent] Run failed: #{e.class} - #{e.message}"
    accumulated << if accumulated.present?
      "\n\n---\n*[Answer interrupted. Retry for a complete answer.]*"
    else
      "An error occurred while processing. Please retry."
    end
  end

  assistant_message.update_columns(
    content: accumulated.presence || assistant_message.content,
    chunk_ids: result[:cited_chunk_ids],
    rag_metadata: {
      agentic: true,
      model: Rails.configuration.x.llm.agent_model,
      searched: result[:searched],
      steps: result[:steps]
    }
  )
  assistant_message.reload
  assistant_message.broadcast_final_content
  broadcast_references(assistant_message, user) if result[:cited_chunk_ids].any?
end

This method does a lot, so let me unpack the order of operations:

Create an empty assistant message first. Turbo Stream broadcasts need a DOM target. By creating the message before the agent starts, the message bubble (with its message_<id>_steps and message_<id>_content divs) is already on the page, and every broadcast that follows has somewhere to land. The +"" is a small Ruby detail: string literals can be frozen, and +"" guarantees a mutable string we can append to with <<.
Define the two callbacks as lambdas. on_content appends each text chunk to accumulated and re-broadcasts the full content so far (full content, not the delta, so a missed frame does not corrupt the message). on_step forwards each agent step to the step trail in the UI.
Run the agent inside a begin/rescue. If the run blows up, the user still gets something honest: either the partial answer with an “interrupted” notice appended, or a plain error message if nothing was generated yet. result is initialized to an empty-but-valid hash before the begin, so the code after the rescue can use it without nil checks.
Persist with update_columns. It writes content, cited chunk ids, and the whole debug trail (rag_metadata) in one statement, skipping callbacks. That matters here because Message has an after_create_commit broadcast and we are managing broadcasts by hand; we do not want persistence to trigger more UI updates than we ask for.
Final broadcasts. broadcast_final_content replaces the streamed plain text with the markdown-rendered version, and broadcast_references fills the references panel from the cited chunks.

The trap: do not run the agent on the persisted chat

This was the biggest gotcha. My Chat model uses RubyLLM’s Rails integration:

class Chat < ApplicationRecord
  acts_as_chat messages_foreign_key: :chat_id
  belongs_to :user
  belongs_to :category, optional: true
end

acts_as_chat is great for the single-shot pipeline: it persists every message automatically. But an agent run produces many intermediate messages: an assistant message per tool round, plus a tool-result message per tool call. If you run the agent loop directly on the persisted chat, all of those get persisted and broadcast, and your UI fills up with empty assistant bubbles.

The fix: run the agent on a non-persisted RubyLLM chat, seeded by hand with the visible conversation history, and manage a single assistant message yourself:

def build_agent_chat(chat)
  # assume_model_exists lets us use models newer than the bundled ruby_llm
  # registry; provider must be given for that to work.
  llm = RubyLLM.chat(
    model: Rails.configuration.x.llm.agent_model,
    provider: Rails.configuration.x.llm.agent_provider.to_sym,
    assume_model_exists: true
  )
  history = chat.messages.where(role: %w[user assistant], hidden: false)
                .where.not(content: [ nil, "" ]).order(:created_at)
  history.each { |m| llm.add_message(role: m.role.to_sym, content: m.content.to_s) }
  llm
end

Two things to note here:

assume_model_exists: true with an explicit provider: lets you call a model that is newer than the model registry bundled with your RubyLLM version. Without it, RubyLLM rejects model names it does not know.
The history excludes hidden messages and empty ones (including the empty assistant placeholder we just created for streaming).

Streaming the step trail

The user should not stare at a spinner for twenty seconds. Every tool call broadcasts a step into the message via Turbo Streams:

# app/models/message.rb
def broadcast_agent_step(step)
  broadcast_append_to "chat_#{chat_id}",
    target: "message_#{id}_steps",
    partial: "messages/agent_steps",
    locals: { step: step }
end

def broadcast_streaming_content(full_content)
  broadcast_update_to "chat_#{chat_id}",
    target: "message_#{id}_content",
    partial: "messages/content",
    locals: { content: full_content }
end

The difference between the two methods is the Turbo action. Steps use broadcast_append_to: each step is a new line added to the message_<id>_steps container, so the trail grows. Content uses broadcast_update_to: every broadcast replaces the content of message_<id>_content with the accumulated text so far. Both go over the same chat_<id> stream that the page subscribed to when it rendered, so this works with zero custom JavaScript: the message partial just needs divs with those ids.

So while the agent works, the user sees lines appear in real time:

💭 The question is about   contribution rates, I search the contribution regulation
🔍 Searching:   contribution rates quota A quota B
📄 Reading a document
🔍 Searching:   quota B reduced rate requirements

and then the answer streams in below. This changed how the feature feels more than anything else: the same latency reads as “it is working hard” instead of “it is stuck”.

Step 7: configuration

All the model choices live in one initializer, driven by ENV:

Rails.application.configure do
  config.x.llm.fast_model = ENV.fetch("LLM_FAST_MODEL", "gpt-4o-mini")
  # Model that drives the agentic tool loop, needs reliable function-calling.
  config.x.llm.agent_model = ENV.fetch("LLM_AGENT_MODEL", "gpt-5.5").presence || "gpt-5.5"
  config.x.llm.agent_provider = ENV.fetch("LLM_AGENT_PROVIDER", "openai").presence || "openai"
end

RubyLLM.configure do |config|
  config.openai_api_key = ENV["OPENAI_API_KEY"]
  config.openrouter_api_key = ENV["OPENROUTER_API_KEY"]
  config.default_model = ENV.fetch("LLM_DEFAULT_MODEL", "moonshotai/kimi-k2.5").presence || "moonshotai/kimi-k2.5"
  config.default_embedding_model = ENV.fetch("LLM_EMBEDDING_MODEL", "text-embedding-3-small").presence || "text-embedding-3-small"
  config.request_timeout = 120
  config.max_retries = 3
  config.use_new_acts_as = true
end

The agent model is a separate setting from the default chat model on purpose. The agent loop lives or dies on function-calling quality: a model that is fine at chatting can be unreliable at deciding when to call tools and at producing well-formed arguments. Pick the agent model for tool use, pick the rest for cost.

The agent also runs at low temperature (0.2). You want determinism in a loop: creative tool-calling is not a feature.

What I learned

Reuse your single-shot retrieval as the tool. The agent’s search tool is one line calling Chunk.hybrid_search, the same method the old pipeline used. Every retrieval improvement I made before (hybrid search, RRF, MMR, parent chunks, embedding cache) carried over for free. Build good retrieval first, make it agentic second.

Tool results are prompts. The text your tool returns is read by the model, so design it like a prompt: stable structure, machine-readable ids ([doc:42]), explicit instructions on failure (“Try again with different terms”), and an explicit instruction when the budget runs out (“Answer now with the information you already collected”).

Budget everything. Max search calls, max notes, truncated chunk content (700 chars in search results, 1200 in section reads), a chunk limit per search. The model adapts to whatever limits you set; without limits it adapts to your wallet.

Keep an audit trail. Persisting the step trail and the agent’s notes in rag_metadata on the message turned “the answer is wrong, no idea why” into “the agent searched the wrong term on step 2”. This is the single best investment for operating the thing in production.

Keep a kill switch. One runtime setting flips between agentic and single-shot. The day the agent misbehaves or the provider has issues, you fall back without a deploy.

The complete flow, end to end: controller creates the user message and enqueues a job, the job builds a fresh RubyLLM chat with history, the agent registers four tools and a system prompt, RubyLLM runs the tool loop while we stream steps and content over Turbo Streams, and the job persists the answer with citations and a full debug trail. All of it is plain Ruby objects you can read in one sitting, which, after looking at some agent frameworks, I consider the main feature.