Self-hosted Infra Monitor

Status — Completed lab. Built, configured, and run locally as a deliberate learning exercise. The Terraform module, Compose stack, dashboards, alert rules, and runbook are kept for reuse the next time I need to stand observability up from scratch. Repository is private while it's being prepared for publication.

One-sentence summary

A self-hosted observability stack — metrics, logs, alerts — running on a single VPS, fully reproducible from a Terraform module.

Problem

I wanted to actually understand Prometheus, Loki, and Grafana before defaulting to managed offerings. Reading docs only gets you so far; you need a system that breaks, alerts, and recovers in real life.

Goal

A reproducible observability lab that:

Stands up cleanly from a single terraform apply + make up.
Surfaces real failures (disk pressure, scrape misses, probe regressions) and stays quiet on noise (CPU spikes, transient memory blips).
Can be torn down and rebuilt without ClickOps in any console.

Architecture

User / Browser → Nginx reverse proxy with Let's Encrypt TLS → Application services → Prometheus (metrics) and Loki (logs via Promtail) → Grafana dashboards and Alertmanager. Alertmanager fans out to Discord; backups ship to S3 via Restic on a nightly cron.

A polished card-based diagram lives in the artifacts above; the ASCII fallback below is what I sketched while wiring it.

 ┌──────────────┐      ┌────────────┐      ┌────────────┐
 │  Promtail    │─────▶│   Loki     │◀─────│  Grafana   │
 └──────────────┘      └────────────┘      └────┬───────┘
                                                │
 ┌──────────────┐      ┌────────────┐           │
 │ node_exporter│─────▶│ Prometheus │───────────┘
 └──────────────┘      └─────┬──────┘
                             │ alerts
                             ▼
                       ┌────────────┐      ┌────────────┐
                       │ Alertmgr   │─────▶│  Discord   │
                       └────────────┘      └────────────┘

What I built

VPS provisioning with a Terraform module compatible with DigitalOcean and Hetzner providers, using remote state and tagged outputs for SSH and DNS.
Docker Compose stack for Prometheus, Loki, Promtail, Grafana, Alertmanager, and node_exporter, with named volumes, healthchecks, and a single-network topology.
Nginx reverse proxy terminating TLS via Let's Encrypt, with automatic renewal and a reload-on-cert-change hook.
Dashboards and recording rules in Grafana that pre-aggregate hot series, so panel queries stay sub-second under load.
Alert rules for disk pressure, scrape failures, and probe regressions — wired through Alertmanager to a Discord webhook.
Backups with Restic to S3, nightly, including the Grafana DB and dashboard JSON.
Runbook covering bring-up, common failure modes, cert renewal, and clean tear-down.

Tools used

IaC — Terraform (DigitalOcean / Hetzner provider compatible)
Runtime — Docker Compose with healthchecks and named volumes
Observability — Prometheus, Loki, Promtail, Grafana, Alertmanager
Edge — Nginx reverse proxy, TLS via Let's Encrypt
Backups — Restic to S3
Linux — Ubuntu LTS, systemd unit files, shell scripts

Key DevOps / Cloud concepts demonstrated

Infrastructure as code with Terraform — no console clicks.
Observability as a design discipline (cardinality, recording rules, alert relevance), not a bolt-on.
Edge-layer reliability (TLS, reverse proxy, certificate rotation).
Deliberate failure modes — alerts fire on things I'd actually act on.
Reproducible environments — terraform destroy followed by terraform apply rebuilds the lab in minutes.

Lessons learned

Cardinality discipline. First-pass dashboards were beautiful and expensive. Trimming labels and switching to recording rules cut series count substantially.
Promtail file positions got wiped during a sloppy compose change; I added a named volume and a healthcheck that asserts the positions file exists.
Cert renewal in containers. Mounting /etc/letsencrypt correctly and reloading Nginx on cert change took a couple of broken Sundays.
Compose is the right primitive for a single-node lab — Kubernetes here would be honking complexity for no payoff.

Next improvements

A status-page front end fed by the Prometheus query API.
OpenTelemetry traces for the status-page service.
A chaos-day script that randomly stops one container and asserts the alerts fire and recover within SLO.