I Built a Cron Job Monitor Because Silence Kills Production

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • MyrinNew
    Senior Member
    • Feb 2024
    • 5175

    #1

    I Built a Cron Job Monitor Because Silence Kills Production

    I Built a Cron Job Monitor Because Silence Kills Production

    Three months ago, my client's daily database backup hadn't run in 11 days. The cron job was still scheduled. No errors in the logs. The monitoring dashboard was green. Everything looked fine.


    Until someone tried to restore from a backup that didn't exist.


    That's when I learned the hard way: traditional monitoring is terrible at catching things that don't happen.


    The Problem with Traditional Monitoring

    Most monitoring tools are great at telling you when something breaks:
    • Server is down? Alert.
    • API returns 500? Alert.
    • Disk is full? Alert.


    But what about when your nightly backup job silently stops running? Or your data sync task fails to start? Or your cleanup script never executes?


    Silence.


    Traditional monitoring watches for events. Cron jobs that don't run don't generate events. They just... don't happen. And you don't find out until it's too late.


    The "Dead Man's Switch" Approach

    After the backup incident, I started thinking differently about monitoring scheduled tasks. Instead of watching for failures, what if we watched for missing successes?


    The concept is simple:

    1. Your cron job pings an endpoint when it runs successfully
    2. If the ping doesn't arrive within the expected window, you get alerted
    3. Silence = failure


    It's like a dead man's switch on a train. If the operator stops pressing the button (the "heartbeat"), the train stops. If your job stops checking in, you get alerted.


    Building CronGuard

    I built CronGuard to solve this for myself and my clients. The core idea is dead simple:


    Every monitor gets a unique ping URL. Your job hits that URL when it completes. If we don't get a ping within the expected schedule, we alert you.


    Here's what a basic integration looks like:






    #!/bin/bash

    # Your backup script
    pg_dump mydb > backup.sql
    tar -czf backup-$(date +%Y%m%d).tar.gz backup.sql
    aws s3 cp backup-$(date +%Y%m%d).tar.gz s3://my-backups/

    # Ping CronGuard when done
    curl -fsS https://cronguard.app/api/ping/your-monitor-id







    That's it. If the backup fails, the ping doesn't happen. If the cron job stops running, the ping doesn't happen. If the server dies, the ping doesn't happen.


    In all cases: you get alerted.


    The Technical Choices That Mattered

    1. Keep the Ping Endpoint Stupid Simple

    The ping endpoint is just an HTTP GET. No authentication required (the URL itself is the secret). No JSON body. No headers. Just:






    curl https://cronguard.app/api/ping/abc123







    Why? Because I wanted it to work from anywhere:
    • Bash scripts
    • Python scripts
    • Inside Docker containers
    • Lambda functions
    • GitHub Actions
    • Cron jobs on a Raspberry Pi


    If you can make an HTTP request, you can monitor your job. No SDK. No dependencies. No authentication dance.


    2. Grace Periods Are Critical

    Here's a mistake I made early: treating cron schedules as exact.


    If a job is scheduled for 03:00, and runs at 03:02, that's not a failure. Servers reboot. Tasks queue. Execution time varies.


    CronGuard uses grace periods:
    • Daily job at 03:00? Alert if no ping by 04:00.
    • Hourly job? Alert if no ping after 70 minutes.
    • Every 5 minutes? Alert after 7 minutes.


    This eliminated false positives and made the system actually useful.


    3. The First Ping Problem

    When you create a new monitor, you haven't sent your first ping yet. Should the system immediately alert that the job is "down"?


    No. That's annoying.


    Solution: monitors are in a "waiting" state until they receive their first ping. After that, the clock starts ticking.


    4. Recovery Notifications Matter

    Early version: alert when job stops checking in. Done.


    Reality: you also want to know when it starts working again.


    Now CronGuard sends recovery notifications too:
    • "Backup job missed check-in (expected by 04:00)"
    • "Backup job recovered (checked in at 04:15)"


    This helps you confirm your fix actually worked.


    Lessons from Running It in Production

    Lesson 1: Cron Jobs Fail More Than You Think

    After deploying CronGuard for my own infrastructure and a handful of clients, I learned that cron jobs are fragile.


    Things I've seen cause silent failures:
    • Server rebooted, cron daemon didn't restart properly
    • Environment variables missing in cron context
    • Disk full, job can't write temp files
    • Database credentials rotated, job can't connect
    • Dependencies updated, script breaks
    • Path issues (/usr/local/bin not in cron's PATH)


    None of these would trigger traditional monitoring. All of them stopped critical jobs from running.


    Lesson 2: Most People Don't Monitor Their Cron Jobs At All

    I thought everyone had sophisticated monitoring setups. Turns out, most developers and small teams just... hope their cron jobs work.


    They schedule it once, see it run once, and assume it'll run forever. Until it doesn't.


    Lesson 3: The Real Value Is Peace of Mind

    The best feedback I got wasn't "this caught a bug." It was "I sleep better knowing I'll find out if something stops working."


    That's the real value: confidence that silence won't kill production.


    When Dead Man's Switch Monitoring Makes Sense

    This approach isn't for everything. Here's when it works:


    Scheduled tasks: backups, data syncs, cleanup jobs, report generation


    Async workers: if you expect jobs to complete regularly


    Periodic data ingestion: RSS feeds, API polling, scraping


    Real-time services: use traditional uptime monitoring


    Event-driven systems: if execution is unpredictable


    Alternative Approaches (and Why I Didn't Use Them)

    Option 1: Log parsing


    Parse cron output for errors. Problem: no output = no detection.


    Option 2: Process monitoring


    Check if the process is running. Problem: cron spawns processes, they finish.


    Option 3: File timestamps


    Check modification time on output files. Problem: requires filesystem access, brittle.


    Option 4: Traditional uptime monitoring


    Ping the endpoint yourself. Problem: doesn't tell you if the job ran, just if the endpoint responds.


    Dead man's switch monitoring is the only approach that directly answers: "Did my job complete successfully?"


    Try It Yourself

    I run CronGuard as a free service for basic monitoring (5 monitors, 5-minute checks). If you've got cron jobs, backups, or scheduled tasks you care about, give it a shot:





    You can literally start monitoring in 30 seconds:

    1. Create a monitor
    2. Copy the ping URL
    3. Add curl -fsS to the end of your script


    That's it. Now you'll know if it stops working.


    The Bottom Line

    Traditional monitoring watches for things that happen. But some of the most critical failures are things that don't happen.


    If you've got scheduled tasks keeping your infrastructure alive, you need to monitor for silence.


    Because in production, silence kills.





    Questions? Running into silent failures with your own infrastructure? Drop a comment. I'd love to hear your war stories about cron jobs that stopped working and how long it took to notice.


    Built something similar? Using a different approach? Let me know. I'm always curious how other teams solve this.




    More...
Working...