How to detect deploy regressions in production (without an on-call team)
CI passed. Deploy succeeded. Prod is broken. Here's the indie-dev playbook for catching it in under a minute — without a SRE team.
The 30-second window between deploy and pain
You shipped at 11:47 AM. The Vercel build went green at 11:48. By 11:49, every customer hitting /api/checkout is getting a 500. By 11:54, when you finally see the Slack message, $237 of orders have failed.
That's the deploy regression problem. It's the most common production outage for indie SaaS and the easiest one to prevent — but only if you set up the safety net before you need it.
Why CI tests don't catch this
Your test suite runs against fake environment variables, mocked external APIs, and a fresh DB seeded with synthetic data. Production runs against:
- Real env vars (which you may have just renamed in code).
- Real third-party API tokens (which may have expired since you last looked).
- A real database with real schemas (which a migration may have subtly broken).
- Real connection pools, real cold-start behaviour, real CDN cache state.
CI tests catch logic bugs. External monitoring catches environmentbugs. They're different tools for different problems, and most indie projects underinvest in the second one because nobody tells you to.
The minimal indie setup: 2 monitors, 60 seconds
Monitor 1: homepage with keyword match
Probe your production homepage every minute. Configure keyword matching on a phrase that should always appear when the page is healthy — your tagline, your CTA copy, anything that wouldn't accidentally disappear.
URL: https://your-app.com/
Method: GET
Interval: 1 minute
Keyword: "Get started for free"
(must be present)
Expected: 200This catches: CDN issues, build failures that ship a placeholder page, ISR pages that 200-but-show-the-wrong-content, accidental noindex, removed content.
Monitor 2: /api/health
Build a public endpoint that exercises your most fragile dependencies without doing anything customer-facing. For a typical Next.js + Supabase app, it looks like this:
// src/app/api/health/route.ts
import { NextResponse } from "next/server";
import { createSupabaseServiceRole } from "@/lib/supabase/server";
export const runtime = "nodejs";
export const dynamic = "force-dynamic";
export async function GET() {
try {
const supabase = createSupabaseServiceRole();
const { error } = await supabase
.from("profiles")
.select("id", { count: "exact", head: true })
.limit(1);
if (error) throw error;
return NextResponse.json({ ok: true, ts: Date.now() });
} catch (e) {
return NextResponse.json(
{ ok: false, error: (e as Error).message },
{ status: 503 },
);
}
}Then probe it the same way as the homepage:
URL: https://your-app.com/api/health
Method: GET
Interval: 1 minute
Keyword: "ok":true
(must be present)
Expected: 200This catches: missing env vars, DB pool exhaustion, expired tokens, broken Supabase RLS policies, network issues between your function and your database.
The smoke-test patterns that catch the rest
Pattern 1: per-route /api health
For routes that exercise a specific dependency (Stripe, OpenAI, S3), consider per-route health endpoints. Example: /api/health/stripe does stripe.balance.retrieve() and returns 200 if it succeeds, 503 otherwise. Probe each. When Stripe is degraded, you find out without a customer telling you.
Pattern 2: synthetic transaction monitoring
Beyond "does the page render", monitor whether a key flow completes. Example: a probe-only signup endpoint that creates a test user, runs the welcome email path with a no-op flag, and rolls back. Heavier to build but catches the worst class of regression — the one where the page loads but the action silently fails.
Pattern 3: the canary phrase
Embed a comment in your homepage HTML that you always update on deploy:
<!-- deploy: 2026-04-30T11:48:00Z -->
Use keyword monitoring to assert it's present. If it disappears, the deploy didn't finish. If the timestamp is older than expected, your build pipeline is stuck.
What to do when the alert fires
The minimal incident playbook for an indie:
- Don't panic-debug. If the alert fires within 5 minutes of a deploy, the deploy is the suspect. Roll back first, investigate after.
- Roll back via your platform.Vercel: Deployments → previous → "Promote to Production". Netlify and Render have similar one-click rollbacks. This takes 30 seconds.
- Verify the alert clears.The next probe (within 60 sec) should show green. If not, the bug isn't in your code — it's in your infra.
- Investigate calmly.Diff your last green deploy against the broken one. 90% of the time it's an env var, a migration, or a renamed file.
- Fix forward, then add a regression test.Even if you can't add a unit test for env-var typos, you can extend your /health endpoint to check the specific dependency that broke.
The math: why 1-minute interval is the right call here
With 5-minute probes, your average detection time is 2.5 minutes. Add 1-2 minutes to investigate and roll back, and you're at 4-5 minutes of customer impact per regression.
With 1-minute probes, average detection drops to 30 seconds. Total time-to-rollback drops to ~2 minutes. The difference per regression is 2-3 fewer minutes of customer pain — multiplied by however many regressions you have per quarter.
For a deeper take on the math, see why 1-minute uptime checks matter.
What this looks like in SitePulse
SitePulse's default settings already match this playbook:
- 1-minute checks on Pro, 5-min on Free.
- Keyword matching is in the UI for every monitor.
- Custom headers (for token-protected /health endpoints).
- Email alerts the moment a probe fails the threshold.
- Public status page included so customers see "yes, we're aware" without you having to send updates.
Set up two monitors — one on your homepage, one on /api/health — and the next deploy regression will email you within a minute, before a customer does.
Frequently asked questions
Why don't CI tests catch this?+
CI tests run against a synthetic environment with mocked secrets. Production has real env vars, real DB connections, real third-party APIs, and real cold-start behaviour. The most common production regressions — misnamed env vars, expired tokens, schema migrations that succeeded but broke a query — never appear in CI. CI tests catch logic bugs; external monitors catch environment bugs.
How fast can I realistically catch a regression?+
With a 1-minute external probe and a sensible /health endpoint, you'll catch a broken deploy within 60-90 seconds of it going live. That's faster than most users will report it, and fast enough to roll back before the regression compounds (e.g., before 100 customers hit the broken /checkout). 5-minute probes give you 5-7 minutes of detection lag — fine for low-stakes deploys, painful for high-stakes ones.
What should /health actually check?+
Just enough to exercise the most fragile parts of your stack without being expensive. For most apps that's: (1) a single Postgres SELECT to confirm DB connectivity, (2) optional: a no-op call to a critical third-party (Stripe, Supabase auth) to catch token expiry. Don't do an LLM call, a webhook send, or anything customer-facing. The endpoint should respond in under 200ms 99% of the time.
Should I monitor /health or my real homepage?+
Both, but for different reasons. /health catches stack-level issues (DB, auth, env vars). Homepage monitoring catches CDN, build, and content issues that /health can't see. The minimal indie setup is: 1 monitor on homepage with keyword check, 1 monitor on /health. Two monitors, ~30 seconds of setup, covers 90% of regression types.
What about Vercel / Netlify deploy hooks?+
They tell you the deploy succeeded — they cannot tell you the deploy is healthy. A successful build can ship code that 500s on first request because of an env-var mismatch. Use deploy hooks for build status; use external monitoring for runtime health. They're complementary, not substitutes.
Won't I get false alarms during deploys?+
Modern platforms (Vercel, Netlify, Render) do zero-downtime deploys, so you shouldn't see false alarms during normal rollouts. If you do, it's usually because your /health endpoint is hitting a connection pool that hasn't warmed up yet. Set a small failure threshold (alert after 2 consecutive failures, not 1) and the noise goes away.
What if I deploy 10 times a day?+
Same playbook, just more important. With 10 deploys/day, the probability that one of them ships a regression is much higher than with 1 deploy/week. External monitoring becomes your safety net for fast iteration — without it, you're trusting that every deploy is perfect, which is a strong claim. The cost ($9/mo) is much smaller than the cost of one bad deploy you didn't notice for an hour.
Related
Try SitePulse free
5 monitors, 5-minute checks, email alerts, public status page — free forever. No credit card.