batman: An MCP Server for Incident Investigation

2026-06-12

Every incident starts the same way. Slack fires. You open Kibana, switch the time range, remember you're in IST and Kibana expects UTC, adjust. Then APM, same dance. Then the database slow query logs because APM says DB latency spiked but won't tell you which query. Then kubectl to check if something restarted. By the time you have a theory, you've spent 10 minutes just loading context.

The investigation itself is 5 minutes. The scaffolding around it is 10. That ratio bothers me.

The Setup

We run on EKS. Application logs go to Elasticsearch, APM traces go to Elastic APM, MongoDB Atlas exports diagnostic and audit logs to S3 queryable via Athena, and Kubernetes is Kubernetes. Each of these is queryable. None of them is queryable from the same place, and none of them know about each other.

That's the problem batman solves.

MCP as Infrastructure Glue

MCP (Model Context Protocol) is Anthropic's standard for exposing external tools to Claude. You define a server, declare what tools it exposes, and Claude decides when to call them based on what you ask. The model handles the routing, you just ask a question in plain English.

For a shared team tool, you want SSE transport: deploy the server once, everyone connects to the same endpoint. One JSON snippet in your Claude Code config and you're in.

batman exposes five tools, application logs, APM transactions, MongoDB diagnostic logs, MongoDB audit logs, and Kubernetes. Each one wraps a single data source and returns structured data that Claude can reason over.

What It Looks Like in Practice

Here's a real investigation from this morning:

"What's the error rate on platform-api right now and what's causing it?"

Claude pulled the APM data for the last hour. Normal baseline, hundreds of thousands of transactions, sub-50ms p95, but two distinct error clusters buried in the breakdown. Two specific endpoints with elevated failure rates, both timing out or returning 5xx.

"Pull the actual error logs."

Claude queried the application logs filtered to errors only, same time window. Over a thousand error entries. Every single one was an outbound call to a third-party API returning a 401, the authorization header was empty. API key either expired or the secret isn't mounted. Root cause in the logs, symptoms in APM.

Two questions. No tab switching. Maybe 90 seconds.

What's interesting is what Claude did between those two questions without being asked: it cross-referenced the error timestamps against the APM span data and noted the errors started spiking 4 minutes before APM flagged the elevated error rate, because APM aggregates over windows and the logs are per-event. That kind of correlation across two systems is exactly the kind of thing that takes 10 minutes manually and happens automatically here.

Building It: The Unglamorous Part

The tools themselves are straightforward wrappers. The hard part was getting the tools to work reliably across years of inconsistent observability choices baked into the data.

We have services written at different points in the company's history, by teams with different logging libraries, different field conventions, and different opinions on what a log entry should look like. Some normalize everything through a shared wrapper, some don't. The log shipping pipeline evolved as we added services. The result is data that looks uniform on the surface, same index naming pattern, same pipeline, but isn't. Field types differ across service families. Field names that look identical mean different things. Service identifiers in the documents don't match the identifiers in the index names because the two naming schemes diverged years ago and nobody reconciled them.

Building a tool that queries across all of it without blowing up on any subset is iterative. You map the actual schema, sample real documents, run queries with one variable at a time, and work backward from failures until you understand what the data actually guarantees. That process took longer than writing the tool. It's also the part that most "build an MCP in 10 minutes" tutorials skip entirely, they query clean toy data with a predictable schema. Production data isn't like that.

Once you have a reliable query layer, the AI side is almost trivial. Claude knows what each tool returns. You ask a question, it picks the right tools, calls them with the right parameters, and reasons over the combined output. The model is doing the hard cognitive work, holding context across multiple data sources simultaneously and drawing connections between them, which is exactly the thing humans are bad at under incident pressure.

MongoDB: Two Logs, Two Different Questions

The MongoDB setup is worth explaining because the two log types answer fundamentally different questions and it's not obvious which to use when.

The diagnostic logs are the database server's own output, slow queries, query plans, whether a query did a full collection scan or used an index, how much data it read, how long it burned CPU. This is where you find the queries that are hurting you. The audit logs are the authorization layer, every operation by collection, user, and command type. This is where you find out which service is hammering a specific collection at 3am.

The first time I queried the diagnostic logs broadly, there was an entry that would never have appeared in APM: a background batch job doing a full scan of a large collection, reading gigabytes of data and burning seconds of CPU per execution, running on a schedule with no web transaction attached to it. APM only sees request-scoped traces. This was invisible until I looked at the database logs directly. Now it shows up in a natural language question.

Deployment

batman is a stateless Python server deployed as a single pod behind an internal load balancer. AWS credentials for the Athena queries come via IRSA, the pod's service account carries an IAM role, boto3 picks it up automatically, no secrets to manage or rotate.

For anyone on the team, setup is one command:

claude mcp add batman --transport sse https://batman.internal.yourdomain.com/sse

Or manually as a JSON snippet in ~/.claude/mcp.json:

# ~/.claude/mcp.json
{
  "mcpServers": {
    "batman": {
      "type": "sse",
      "url": "https://batman.internal.yourdomain.com/sse/"
    }
  }
}

Restart Claude Code or run /mcp. All five tools show up. Needs VPN.

What's Next

A few tools I want to add:

Prometheus, infrastructure metrics. Already exists in the codebase from an earlier version; just needs wiring back into the MCP surface.
CloudWatch, managed service metrics for things not on Kubernetes. Same situation.
Investigation report, a tool that serializes the conversation's findings to a markdown file. Postmortem in one command.

The pattern generalizes to any system with a queryable API. The real investment is understanding your data well enough to wrap it reliably. After that, the AI handles the rest.

Not Just for Engineers

The framing so far has been incident investigation, which is an engineering concern. But the underlying capability, asking a question in plain English and getting an answer drawn from multiple data systems simultaneously, is useful to anyone who needs to understand what's happening in production.

A product manager asking "why did retention drop this week for users who signed up via mobile?" is the same class of problem. It needs application logs, maybe APM for funnel latency, maybe database query patterns. The data exists. The bottleneck is the tooling knowledge required to get to it. batman removes that bottleneck.

Business stakeholders asking "what was the impact of last night's incident in terms of failed transactions?" is the same thing again. The answer is in the logs and APM. Historically that answer required an engineer to write a query. Now it doesn't.

And the tools themselves don't require a specialist to write anymore. The pattern is simple: wrap a queryable API, return structured data. With an AI assistant, someone who understands the data source and can describe what they want to query can build a new tool without needing to know the SDK deeply. The schema archaeology still takes domain knowledge, but the implementation work is largely solved by the AI. That changes who can contribute to a shared observability layer, not just who can use it. Kailash wrote about this shift well, code is cheap now, what's expensive is knowing what to build and being able to articulate it clearly enough for the machine to follow.

The Actual Insight

The hard part of incident investigation isn't analysis. It's context assembly. By the time you've finished the Kibana query, you've half-forgotten what the APM data said. By the time you've written the database query, you've lost the thread of the logs. You're not thinking about the problem, you're managing the tooling around it.

What batman does is offload the mechanical parts entirely. Claude holds all the context across all the data sources simultaneously and reasons over it in one place. You stay in the problem instead of the tooling.

I've said a version of this before with kubeguard: shift the friction. Make the mechanical part automatic so the thinking part gets more of your attention. batman is the same idea applied to incident investigation.

Context switching is not free. And most on-call time is context switching.