The 3am Problem: Building Systems That Don't Need You Awake
Monitoring, alerting, and the automation mindset that changed how I work
3:17am. Phone buzzes. ntfy notification. Unlock it, squint at the brightness, read: "ZFS pool degraded — disk 3 checksum errors increasing, scrub recommended." Look at it for four seconds, put the phone down, go back to sleep.
That's the whole story. Ran a scrub after coffee the next morning. Pool was fine. Checksum errors cleared. The disk went on the weekly watchlist for replacement planning. The only reason that story is boring is because the system was built to make it boring.
A year earlier, there would have been no warning until the pool entered a critical DEGRADED state — a very different notification to receive at 3am. The gap between those two outcomes isn't hardware. It isn't luck. It's the alert hierarchy and the self-healing scripts built in the time between.
The Alert Hierarchy That Saves Sleep
Most homelab monitoring failures aren't monitoring failures — they're calibration failures. The system is watching everything. It's alerting on everything. And because everything pages you, nothing actually registers. You start ignoring alerts. Then you miss the one that mattered.
The question that fixes calibration: if I ignore this for 8 hours, what's the worst case? That single question sorts almost every alert into one of three buckets.
L1 — Page now. Data loss risk: ZFS pool entering DEGRADED from multiple disk failure. Security breach indicators: unexpected outbound connections, failed auth from unusual IPs. Services with real-world exposure: scheduled automation pipelines going dark, anything with an external dependency that compounds overnight. These get ntfy at high priority. They wake you up. There are maybe five of them.
L2 — Tell me when I'm awake. Degraded but stable: one disk with elevated checksum errors, a single container crash that auto-restarted. Error rate increases that aren't yet service-affecting. Disk health trending in the wrong direction. These get ntfy at normal priority. They don't wake anyone up. Check them with coffee.
L3 — Log it, don't alert. Everything else. Backup completions. Cron job outcomes. Health check successes. Routine scrub results. These go to logs. Never to a push notification. Applying this framework typically deletes roughly 60% of existing alerts. Response quality goes up. Alert fatigue goes down. Sleep quality follows.
Uptime Kuma for service availability — checks every service on a schedule, alerts on downtime with enough context to evaluate severity without logging in. Netdata for system metrics — CPU, memory, ZFS pool health, network throughput in real time. Scrutiny for disk health — SMART data aggregated and trended so degradation shows before failure. ntfy for push alerts — mobile notifications that include the alert level, the affected service, and a one-line recommended action. The whole stack runs on self-hosted infrastructure. No cloud dependencies, no SaaS fees.
"Most homelab monitoring failures aren't monitoring failures. They're calibration failures."
Self-Healing Scripts: The 80% Solution
The automation mindset shift that changes everything: stop asking "how do I fix this when it breaks?" and start asking "how do I make it fix itself when it breaks?" The answer to the second question is almost always a 20-line bash script that checks one condition and takes one action. Written once. Running on cron forever.
Here's a representative set of what ends up running on a well-tuned home server:
- ZFS health check — every 6 hours. Reads checksum error counts from
zpool status. If errors exceed threshold, triggers a scrub immediately and sends an L2 notification. On scrub completion, logs the result and sends a follow-up — cleared errors get logged, persistent errors escalate to L1. Manual scrub initiation becomes rare. - Container health check — every 5 minutes. Checks every Docker container for
unhealthyorexitedstate. If found, restarts with exponential backoff: 1 minute wait, then 5 minutes, then 15. After the third consecutive failure it escalates to L1 and stops retrying. This prevents transient crash-loops from waking anyone up while still catching genuine failures. - Inference-service health check — every 10 minutes. Sends a minimal test prompt to the local inference endpoint. Three consecutive failures trigger a container restart, followed by an endpoint health verification before marking it recovered. Once a local model powers a meaningful share of your automation, silent failures stop being acceptable.
- Workflow queue check — every 10 minutes. If the automation workflow queue depth stays above threshold for more than 10 minutes, it restarts the worker process and logs the trigger event. Workflow backlogs that nobody notices are how automation pipelines silently rot.
Every script follows the same structure: check the condition, take exactly one action if the condition is met, log what was done with timestamp and result, notify if the action failed or if this is the Nth recurrence in a window. The key constraint is "exactly one action." Scripts that try to be smart and take multiple recovery paths become unpredictable. One check, one action, escalate if it doesn't resolve. That's it.
One of the most valuable scripts in a setup like this watches a core application process, running every 5 minutes. It checks whether the process is alive and serving requests on its health endpoint. If not, it runs a dependency check, restarts the process, and logs the recovery. A script like this can silently recover a service from crashes a dozen times over a few months, with zero of those crashes noticed in real time. The service just stays up. That's the only stat that matters.
A watchdog for the NAS's own core system daemon is a different kind of script. It runs every minute and checks that daemon process. The key difference: it can't be auto-recovered — restarting it causes worse problems than the original failure. So the script doesn't try to fix it. Instead, it detects the failure and sends an L1 alert immediately. For a process you can't safely auto-recover, fast detection is the entire self-healing strategy. Two minutes from failure to alert is the goal; a well-tuned version of this consistently hits under 90 seconds.
They tell you how to install Prometheus and Grafana. They don't tell you that the hardest part isn't instrumentation — it's calibration. The calibration work is a week of paying attention to which alerts you actually act on versus which ones you dismiss without reading. That work is worth more than better dashboards. An uncalibrated alerting system sends too many alerts, you learn to ignore them, and you end up worse off than having no alerts at all. Do the calibration first. Install more dashboards second.
Sleep Is the Metric
Measure the quality of homelab ops by one number: how many times you got out of bed for it this month. A well-run setup holds at zero for months at a stretch, including through routine events like the ZFS checksum scenario that opened this post. Every non-zero month is a data point about where the system broke down: either an L2 event got miscategorized as L1, or a self-healing script failed to fire, or it was a genuine L1 event that should have been caught and automated away during the day.
Build the automation first. Tune the calibration second. Automation without alerting is blind. Alerting without automation is just a pager that wakes you up to do things a bash script could do. Both together, calibrated correctly, is a system that runs itself and tells you exactly what you need to know. Sleep is the metric.