AI DevOps Copilot

Status — Concept. This is a design study, not a shipped product. No public repository or live demo exists yet. The architecture and decisions below describe what I'd build.

One-line summary

A copilot that turns production noise into a short, sourced summary and a proposed next step — grounded in the team's prior incidents rather than generic advice.

Problem

When something breaks at 2am, the first 10 minutes are spent finding the problem, not solving it. Engineers grep through Loki, scan Grafana, and try to remember if "this happened before". The institutional knowledge usually lives in Slack threads nobody can search.

Solution

A small Next.js app and CLI that:

Pulls a window of logs and metrics for a service.
Retrieves the top-K similar past incidents from a pgvector index.
Asks Claude for a one-paragraph summary, a likely cause, and the next command to run — with citations back to specific log lines.

The model only ever sees redacted, scoped data, and every suggestion links back to the source it came from.

My role

If built, this would be a solo end-to-end project: ingest pipeline, embeddings store, prompt contracts, the CLI, and the Next.js review UI.

Tech stack

App — Next.js (App Router), TypeScript, Tailwind, shadcn/ui
AI — Claude API with prompt caching and tool use
Retrieval — PostgreSQL + pgvector, hybrid lexical/dense search
Backend — Node.js workers, BullMQ on Redis
Infra — Docker Compose for local dev; staging on Fly.io

Architecture diagram

 logs/metrics ──▶ ingest worker ──▶ chunks ──▶ embeddings ──▶ pgvector
                                                                  │
                                                                  ▼
 incident query ──▶ retriever ──▶ prompt builder ──▶ Claude API ──▶ UI
                                                                  │
                                                                  ▼
                                                           cited summary

Key features (planned)

Citations are first-class. Every claim in the UI links to the log line that produced it. No citation, no claim.
Prompt caching to keep tail latency and cost predictable.
Tool use so the model runs scoped, allowlisted commands (describe, tail, metric) instead of inventing answers.
Local-first dev — docker compose up boots Postgres, Redis, and the worker; no cloud accounts required.

Anticipated challenges

Hallucination control. A first prototype will be confident and wrong; forcing every claim to cite a chunk is the lever.
Embedding cost. Logs are noisy; deduplicating near-identical lines before embedding is essential to keep cost reasonable.
Streaming UX. Server actions plus suspense streams should keep the UI live while the model is still thinking, without a websocket layer.

What I expect to learn

A small, sharp retrieval set beats a fat, generic context window.
Citations are an accuracy lever, not a UX nicety.
Latency budgets matter more than benchmarks once humans are in the loop.

Improvements planned beyond v1

First-class Loki and Grafana data sources.
Eval suite with golden incidents and regression scoring.
A "drafts to PR" mode that opens a runbook PR instead of just suggesting one.