Why I Built LlmLogs — Anton Kopylov

Most Rails apps that use LLMs start with something simple: call a model, get a response, save the result.

That works until it doesn’t.

At some point I wanted to answer basic questions and realized the answers were scattered across logs, database rows, console history, and memory:

Which prompt produced this result?
Which model did it use?
How many tokens did it cost?
Did the failure happen in the main request or inside a nested tool call?
Did the prompt change since the last good run?
Can I replay the reasoning path without reading raw application logs?

So I built LlmLogs: a mountable Rails engine for LLM tracing and prompt management.

It is not a hosted observability product. It is closer to the thing I wanted inside my own Rails apps: install a gem, mount an engine, and get enough visibility to debug real LLM workflows.

The core idea: traces and spans

The main abstraction is familiar if you have used application tracing before.

A trace is one logical operation:

LlmLogs.trace("strategy_analysis", metadata: { strategy_id: 42 }) do
  chat = RubyLLM.chat(model: "anthropic/claude-sonnet-4")
  chat.ask("Analyze this strategy...")
end

Every LLM call inside that block becomes a span. Nested work can become nested spans:

LlmLogs.trace("full_pipeline") do
  chat.ask("Step 1...")

  LlmLogs.trace("risk_assessment") do
    chat.ask("What are the risks?")
  end
end

The engine also auto-instruments ruby_llm, so a lot of calls get captured without wrapping every line manually.

Each span records the pieces I usually need when debugging:

provider and model
input messages
output content
input, output, and cached token counts
duration
errors
custom metadata
tool calls as child spans

The goal is not to collect everything forever. The goal is to make a failing or surprising AI workflow inspectable while I still remember what I was trying to build.

Prompt management belongs next to traces

Tracing alone was not enough. In LLM apps, the prompt is part of the code.

If a result changes, the first question is often not “did the code change?” but “did the prompt change?”

LlmLogs has a prompt model with versioned content. Prompts can be stored as Mustache templates, rendered with variables, and rolled back when needed:

prompt = LlmLogs::Prompt.load("strategy-analysis")
params = prompt.build(
  app_name: "Tradebot",
  strategy_name: "Momentum Alpha",
  timeframe: "4h"
)

Every save creates a new version. Previous versions stay intact. That means a trace can point back to the prompt version that produced it, instead of just showing “whatever the prompt happens to be today.”

There is also a sync task for file-based prompts:

bin/rails llm_logs:prompts:sync

That lets me keep prompts in Markdown files under the app, review them in Git, and still browse/use them through the Rails UI.

Batch jobs need observability too

The newer part of LlmLogs is batch support.

OpenAI’s Responses Batch API is useful when latency does not matter and cost does. But batch workflows introduce a different debugging problem: work is submitted now, completed later, handled by background jobs, and often routed back into application records.

LlmLogs persists each batch request, groups pending requests by purpose and model, submits provider batches, polls for results, and records a trace per request.

A simplified enqueue call looks like this:

LlmLogs::Batch.enqueue(
  purpose: "chat_summary",
  model: "gpt-4.1-mini",
  instructions: "Summarize the conversation in two sentences.",
  input: conversation_text,
  routing: { conversation_id: 42 }
)

Then the app registers a handler for that purpose:

LlmLogs.register_batch_handler("chat_summary", ChatSummaryHandler.new)

The important detail: a request is only marked succeeded after the handler finishes. If the handler fails, the failure remains visible instead of disappearing into a background job log.

That is the kind of boring reliability detail that matters once an LLM feature starts doing useful work.

Why a Rails engine?

I could have made this a separate service. But for my use case, a mountable Rails engine had better tradeoffs.

Most of the data already belongs near the app:

application record ids
prompt versions
background job state
user-facing results
model metadata

A Rails engine can use the host app’s database, auth boundary, deployment, backups, and background jobs. No extra service to run. No separate dashboard to secure. No SaaS integration required just to inspect my own prompts.

Installation is intentionally Rails-native:

gem "llm_logs"

bin/rails generate llm_logs:install
bin/rails db:migrate

Then mount it:

mount LlmLogs::Engine, at: "/llm_logs"

What I learned building it

The main lesson: LLM observability should be designed around workflow state, not just API calls.

A single model request is rarely the whole story. Real features involve prompts, versions, retries, tools, nested operations, background jobs, and application-specific routing.

The useful dashboard is the one that answers:

“What happened in this product workflow, and why did it produce that result?”

That is what I am trying to make LlmLogs good at.

It is still a small gem, but it already solves a real problem in my own projects: it makes AI features feel less like magic and more like software I can operate.