Skip to content
All projects

Case study · Nov 18, 2025

Completed lab

Self-hosted Infra Monitor

A production-style infrastructure monitoring lab — Docker Compose, Prometheus, Grafana, Loki, Nginx + TLS, alerting, backups, and runbooks — provisioned with Terraform on a single VPS.

Docker
Prometheus
Grafana
Loki
Nginx
Terraform
Linux
Runbooks

Status — Completed lab. Built, configured, and run locally as a deliberate learning exercise. The Terraform module, Compose stack, dashboards, alert rules, and runbook are kept for reuse the next time I need to stand observability up from scratch. Repository is private while it's being prepared for publication.

One-sentence summary

A self-hosted observability stack — metrics, logs, alerts — running on a single VPS, fully reproducible from a Terraform module.

Problem

I wanted to actually understand Prometheus, Loki, and Grafana before defaulting to managed offerings. Reading docs only gets you so far; you need a system that breaks, alerts, and recovers in real life.

Goal

A reproducible observability lab that:

  • Stands up cleanly from a single terraform apply + make up.
  • Surfaces real failures (disk pressure, scrape misses, probe regressions) and stays quiet on noise (CPU spikes, transient memory blips).
  • Can be torn down and rebuilt without ClickOps in any console.

Architecture

User / Browser → Nginx reverse proxy with Let's Encrypt TLS → Application services → Prometheus (metrics) and Loki (logs via Promtail) → Grafana dashboards and Alertmanager. Alertmanager fans out to Discord; backups ship to S3 via Restic on a nightly cron.

A polished card-based diagram lives in the artifacts above; the ASCII fallback below is what I sketched while wiring it.

 ┌──────────────┐      ┌────────────┐      ┌────────────┐
 │  Promtail    │─────▶│   Loki     │◀─────│  Grafana   │
 └──────────────┘      └────────────┘      └────┬───────┘
                                                │
 ┌──────────────┐      ┌────────────┐           │
 │ node_exporter│─────▶│ Prometheus │───────────┘
 └──────────────┘      └─────┬──────┘
                             │ alerts
                             ▼
                       ┌────────────┐      ┌────────────┐
                       │ Alertmgr   │─────▶│  Discord   │
                       └────────────┘      └────────────┘

What I built

  • VPS provisioning with a Terraform module compatible with DigitalOcean and Hetzner providers, using remote state and tagged outputs for SSH and DNS.
  • Docker Compose stack for Prometheus, Loki, Promtail, Grafana, Alertmanager, and node_exporter, with named volumes, healthchecks, and a single-network topology.
  • Nginx reverse proxy terminating TLS via Let's Encrypt, with automatic renewal and a reload-on-cert-change hook.
  • Dashboards and recording rules in Grafana that pre-aggregate hot series, so panel queries stay sub-second under load.
  • Alert rules for disk pressure, scrape failures, and probe regressions — wired through Alertmanager to a Discord webhook.
  • Backups with Restic to S3, nightly, including the Grafana DB and dashboard JSON.
  • Runbook covering bring-up, common failure modes, cert renewal, and clean tear-down.

Tools used

  • IaC — Terraform (DigitalOcean / Hetzner provider compatible)
  • Runtime — Docker Compose with healthchecks and named volumes
  • Observability — Prometheus, Loki, Promtail, Grafana, Alertmanager
  • Edge — Nginx reverse proxy, TLS via Let's Encrypt
  • Backups — Restic to S3
  • Linux — Ubuntu LTS, systemd unit files, shell scripts

Key DevOps / Cloud concepts demonstrated

  • Infrastructure as code with Terraform — no console clicks.
  • Observability as a design discipline (cardinality, recording rules, alert relevance), not a bolt-on.
  • Edge-layer reliability (TLS, reverse proxy, certificate rotation).
  • Deliberate failure modes — alerts fire on things I'd actually act on.
  • Reproducible environments — terraform destroy followed by terraform apply rebuilds the lab in minutes.

Lessons learned

  • Cardinality discipline. First-pass dashboards were beautiful and expensive. Trimming labels and switching to recording rules cut series count substantially.
  • Promtail file positions got wiped during a sloppy compose change; I added a named volume and a healthcheck that asserts the positions file exists.
  • Cert renewal in containers. Mounting /etc/letsencrypt correctly and reloading Nginx on cert change took a couple of broken Sundays.
  • Compose is the right primitive for a single-node lab — Kubernetes here would be honking complexity for no payoff.

Next improvements

  • A status-page front end fed by the Prometheus query API.
  • OpenTelemetry traces for the status-page service.
  • A chaos-day script that randomly stops one container and asserts the alerts fire and recover within SLO.

Artifacts