Case study · Feb 4, 2026
AI DevOps Copilot
A planned CLI + web copilot that triages production logs, drafts incident summaries, and proposes runbooks — grounded in retrieved prior incidents.
Status — Concept. This is a design study, not a shipped product. No public repository or live demo exists yet. The architecture and decisions below describe what I'd build.
One-line summary
A copilot that turns production noise into a short, sourced summary and a proposed next step — grounded in the team's prior incidents rather than generic advice.
Problem
When something breaks at 2am, the first 10 minutes are spent finding the problem, not solving it. Engineers grep through Loki, scan Grafana, and try to remember if "this happened before". The institutional knowledge usually lives in Slack threads nobody can search.
Solution
A small Next.js app and CLI that:
- Pulls a window of logs and metrics for a service.
- Retrieves the top-K similar past incidents from a
pgvectorindex. - Asks Claude for a one-paragraph summary, a likely cause, and the next command to run — with citations back to specific log lines.
The model only ever sees redacted, scoped data, and every suggestion links back to the source it came from.
My role
If built, this would be a solo end-to-end project: ingest pipeline, embeddings store, prompt contracts, the CLI, and the Next.js review UI.
Tech stack
- App — Next.js (App Router), TypeScript, Tailwind, shadcn/ui
- AI — Claude API with prompt caching and tool use
- Retrieval — PostgreSQL + pgvector, hybrid lexical/dense search
- Backend — Node.js workers, BullMQ on Redis
- Infra — Docker Compose for local dev; staging on Fly.io
Architecture diagram
logs/metrics ──▶ ingest worker ──▶ chunks ──▶ embeddings ──▶ pgvector
│
▼
incident query ──▶ retriever ──▶ prompt builder ──▶ Claude API ──▶ UI
│
▼
cited summary
Key features (planned)
- Citations are first-class. Every claim in the UI links to the log line that produced it. No citation, no claim.
- Prompt caching to keep tail latency and cost predictable.
- Tool use so the model runs scoped, allowlisted commands (
describe,tail,metric) instead of inventing answers. - Local-first dev —
docker compose upboots Postgres, Redis, and the worker; no cloud accounts required.
Anticipated challenges
- Hallucination control. A first prototype will be confident and wrong; forcing every claim to cite a chunk is the lever.
- Embedding cost. Logs are noisy; deduplicating near-identical lines before embedding is essential to keep cost reasonable.
- Streaming UX. Server actions plus suspense streams should keep the UI live while the model is still thinking, without a websocket layer.
What I expect to learn
- A small, sharp retrieval set beats a fat, generic context window.
- Citations are an accuracy lever, not a UX nicety.
- Latency budgets matter more than benchmarks once humans are in the loop.
Improvements planned beyond v1
- First-class Loki and Grafana data sources.
- Eval suite with golden incidents and regression scoring.
- A "drafts to PR" mode that opens a runbook PR instead of just suggesting one.