Case study · Nov 18, 2025
Self-hosted Infra Monitor
A production-style infrastructure monitoring lab — Docker Compose, Prometheus, Grafana, Loki, Nginx + TLS, alerting, backups, and runbooks — provisioned with Terraform on a single VPS.
Status — Completed lab. Built, configured, and run locally as a deliberate learning exercise. The Terraform module, Compose stack, dashboards, alert rules, and runbook are kept for reuse the next time I need to stand observability up from scratch. Repository is private while it's being prepared for publication.
One-sentence summary
A self-hosted observability stack — metrics, logs, alerts — running on a single VPS, fully reproducible from a Terraform module.
Problem
I wanted to actually understand Prometheus, Loki, and Grafana before defaulting to managed offerings. Reading docs only gets you so far; you need a system that breaks, alerts, and recovers in real life.
Goal
A reproducible observability lab that:
- Stands up cleanly from a single
terraform apply+make up. - Surfaces real failures (disk pressure, scrape misses, probe regressions) and stays quiet on noise (CPU spikes, transient memory blips).
- Can be torn down and rebuilt without ClickOps in any console.
Architecture
User / Browser → Nginx reverse proxy with Let's Encrypt TLS → Application services → Prometheus (metrics) and Loki (logs via Promtail) → Grafana dashboards and Alertmanager. Alertmanager fans out to Discord; backups ship to S3 via Restic on a nightly cron.
A polished card-based diagram lives in the artifacts above; the ASCII fallback below is what I sketched while wiring it.
┌──────────────┐ ┌────────────┐ ┌────────────┐
│ Promtail │─────▶│ Loki │◀─────│ Grafana │
└──────────────┘ └────────────┘ └────┬───────┘
│
┌──────────────┐ ┌────────────┐ │
│ node_exporter│─────▶│ Prometheus │───────────┘
└──────────────┘ └─────┬──────┘
│ alerts
▼
┌────────────┐ ┌────────────┐
│ Alertmgr │─────▶│ Discord │
└────────────┘ └────────────┘
What I built
- VPS provisioning with a Terraform module compatible with DigitalOcean and Hetzner providers, using remote state and tagged outputs for SSH and DNS.
- Docker Compose stack for Prometheus, Loki, Promtail, Grafana, Alertmanager, and node_exporter, with named volumes, healthchecks, and a single-network topology.
- Nginx reverse proxy terminating TLS via Let's Encrypt, with automatic renewal and a reload-on-cert-change hook.
- Dashboards and recording rules in Grafana that pre-aggregate hot series, so panel queries stay sub-second under load.
- Alert rules for disk pressure, scrape failures, and probe regressions — wired through Alertmanager to a Discord webhook.
- Backups with Restic to S3, nightly, including the Grafana DB and dashboard JSON.
- Runbook covering bring-up, common failure modes, cert renewal, and clean tear-down.
Tools used
- IaC — Terraform (DigitalOcean / Hetzner provider compatible)
- Runtime — Docker Compose with healthchecks and named volumes
- Observability — Prometheus, Loki, Promtail, Grafana, Alertmanager
- Edge — Nginx reverse proxy, TLS via Let's Encrypt
- Backups — Restic to S3
- Linux — Ubuntu LTS, systemd unit files, shell scripts
Key DevOps / Cloud concepts demonstrated
- Infrastructure as code with Terraform — no console clicks.
- Observability as a design discipline (cardinality, recording rules, alert relevance), not a bolt-on.
- Edge-layer reliability (TLS, reverse proxy, certificate rotation).
- Deliberate failure modes — alerts fire on things I'd actually act on.
- Reproducible environments —
terraform destroyfollowed byterraform applyrebuilds the lab in minutes.
Lessons learned
- Cardinality discipline. First-pass dashboards were beautiful and expensive. Trimming labels and switching to recording rules cut series count substantially.
- Promtail file positions got wiped during a sloppy compose change; I added a named volume and a healthcheck that asserts the positions file exists.
- Cert renewal in containers. Mounting
/etc/letsencryptcorrectly and reloading Nginx on cert change took a couple of broken Sundays. - Compose is the right primitive for a single-node lab — Kubernetes here would be honking complexity for no payoff.
Next improvements
- A status-page front end fed by the Prometheus query API.
- OpenTelemetry traces for the status-page service.
- A chaos-day script that randomly stops one container and asserts the alerts fire and recover within SLO.
Artifacts
Screenshot to be added: Architecture diagram
Polished card-based diagram of the request and metrics flow.
Screenshot to be added: Grafana dashboard
Service health, latency, and error budget over a 7-day window.
Screenshot to be added: Prometheus targets
Scrape-target health for Prometheus, node_exporter, and the app.
Screenshot to be added: Loki logs
Promtail-shipped logs with label filters and a saved query.
Screenshot to be added: Nginx + TLS
Reverse-proxy config and Let's Encrypt cert renewal flow.