Architecture

The 3am Problem: Building Systems That Don't Need You Awake

Monitoring, alerting, and the automation mindset that changed how I work

3:17am. Phone buzzes. ntfy notification. Unlock it, squint at the brightness, read: "ZFS pool degraded — disk 3 checksum errors increasing, scrub recommended." Look at it for four seconds, put the phone down, go back to sleep.

That's the whole story. Ran a scrub after coffee the next morning. Pool was fine. Checksum errors cleared. The disk went on the weekly watchlist for replacement planning. The only reason that story is boring is because the system was built to make it boring.

A year earlier, there would have been no warning until the pool entered a critical DEGRADED state — a very different notification to receive at 3am. The gap between those two outcomes isn't hardware. It isn't luck. It's the alert hierarchy and the self-healing scripts built in the time between.

The Alert Hierarchy That Saves Sleep

Most homelab monitoring failures aren't monitoring failures — they're calibration failures. The system is watching everything. It's alerting on everything. And because everything pages you, nothing actually registers. You start ignoring alerts. Then you miss the one that mattered.

The question that fixes calibration: if I ignore this for 8 hours, what's the worst case? That single question sorts almost every alert into one of three buckets.

L1 — Page now. Data loss risk: ZFS pool entering DEGRADED from multiple disk failure. Security breach indicators: unexpected outbound connections, failed auth from unusual IPs. Services with real-world exposure: scheduled automation pipelines going dark, anything with an external dependency that compounds overnight. These get ntfy at high priority. They wake you up. There are maybe five of them.

L2 — Tell me when I'm awake. Degraded but stable: one disk with elevated checksum errors, a single container crash that auto-restarted. Error rate increases that aren't yet service-affecting. Disk health trending in the wrong direction. These get ntfy at normal priority. They don't wake anyone up. Check them with coffee.

L3 — Log it, don't alert. Everything else. Backup completions. Cron job outcomes. Health check successes. Routine scrub results. These go to logs. Never to a push notification. Applying this framework typically deletes roughly 60% of existing alerts. Response quality goes up. Alert fatigue goes down. Sleep quality follows.

A Monitoring Stack (Self-Hosted, $0 Operating Cost)

Uptime Kuma for service availability — checks every service on a schedule, alerts on downtime with enough context to evaluate severity without logging in. Netdata for system metrics — CPU, memory, ZFS pool health, network throughput in real time. Scrutiny for disk health — SMART data aggregated and trended so degradation shows before failure. ntfy for push alerts — mobile notifications that include the alert level, the affected service, and a one-line recommended action. The whole stack runs on self-hosted infrastructure. No cloud dependencies, no SaaS fees.

"Most homelab monitoring failures aren't monitoring failures. They're calibration failures."

Self-Healing Scripts: The 80% Solution

The automation mindset shift that changes everything: stop asking "how do I fix this when it breaks?" and start asking "how do I make it fix itself when it breaks?" The answer to the second question is almost always a 20-line bash script that checks one condition and takes one action. Written once. Running on cron forever.

Here's a representative set of what ends up running on a well-tuned home server:

The Self-Healing Pattern

Every script follows the same structure: check the condition, take exactly one action if the condition is met, log what was done with timestamp and result, notify if the action failed or if this is the Nth recurrence in a window. The key constraint is "exactly one action." Scripts that try to be smart and take multiple recovery paths become unpredictable. One check, one action, escalate if it doesn't resolve. That's it.

One of the most valuable scripts in a setup like this watches a core application process, running every 5 minutes. It checks whether the process is alive and serving requests on its health endpoint. If not, it runs a dependency check, restarts the process, and logs the recovery. A script like this can silently recover a service from crashes a dozen times over a few months, with zero of those crashes noticed in real time. The service just stays up. That's the only stat that matters.

A watchdog for the NAS's own core system daemon is a different kind of script. It runs every minute and checks that daemon process. The key difference: it can't be auto-recovered — restarting it causes worse problems than the original failure. So the script doesn't try to fix it. Instead, it detects the failure and sends an L1 alert immediately. For a process you can't safely auto-recover, fast detection is the entire self-healing strategy. Two minutes from failure to alert is the goal; a well-tuned version of this consistently hits under 90 seconds.

Automation Readiness Check
You've categorized your alerts into L1/L2/L3 and deleted everything that's actually L3 noise
Container crash → auto-restart is automated (the most common homelab failure mode)
Your alerts include enough context (what broke, severity, recommended action) to make a go/no-go decision without logging in
What Most Homelab Monitoring Guides Miss

They tell you how to install Prometheus and Grafana. They don't tell you that the hardest part isn't instrumentation — it's calibration. The calibration work is a week of paying attention to which alerts you actually act on versus which ones you dismiss without reading. That work is worth more than better dashboards. An uncalibrated alerting system sends too many alerts, you learn to ignore them, and you end up worse off than having no alerts at all. Do the calibration first. Install more dashboards second.

Sleep Is the Metric

Measure the quality of homelab ops by one number: how many times you got out of bed for it this month. A well-run setup holds at zero for months at a stretch, including through routine events like the ZFS checksum scenario that opened this post. Every non-zero month is a data point about where the system broke down: either an L2 event got miscategorized as L1, or a self-healing script failed to fire, or it was a genuine L1 event that should have been caught and automated away during the day.

Build the automation first. Tune the calibration second. Automation without alerting is blind. Alerting without automation is just a pager that wakes you up to do things a bash script could do. Both together, calibrated correctly, is a system that runs itself and tells you exactly what you need to know. Sleep is the metric.