API monitoring guide
Practical patterns for monitoring REST, GraphQL, and webhook APIs — status codes, keyword matching, auth headers, and rate-limit safety.
Why API monitoring is different from website monitoring
Monitoring a marketing page is straightforward: probe the URL, expect 200, look for some keyword in the body. Done.
APIs add three complications:
- Authentication: most useful endpoints require it.
- Status codes lie: GraphQL returns 200 even when the query failed; some APIs return 200 with
{ "error": "..." }. - Rate limits: probing a rate-limited endpoint at 1 probe/minute can eat 25-50% of a strict limiter's budget.
This guide walks through the patterns that handle all three without building a SaaS observability stack.
The default API monitoring shape
Before getting clever, set up the boring version that catches 80% of outages:
URL: https://your-api.com/health Method: GET Interval: 1 minute (Pro) or 5 minutes (Free) Expected: 200-299 Timeout: 15 seconds Keyword: (optional) "ok":true must be present
That's the baseline. The rest of this doc is variations on it for cases where the baseline isn't enough.
Build a /health endpoint, don't probe random routes
The single most important pattern: have a dedicated /health endpoint that exercises your real dependencies but is cheap, no-auth, and idempotent.
A minimal Next.js + Postgres example:
// src/app/api/health/route.ts
import { NextResponse } from "next/server";
import { sql } from "@/lib/db";
export const runtime = "nodejs";
export const dynamic = "force-dynamic";
export async function GET() {
try {
// 1. Trivial DB roundtrip — catches connection-pool issues
await sql`select 1`;
// 2. (Optional) check a critical dep without doing real work
// — e.g., Stripe.balance.retrieve(), or a no-op Supabase auth call
return NextResponse.json({ ok: true, ts: Date.now() });
} catch (e) {
return NextResponse.json(
{ ok: false, error: (e as Error).message },
{ status: 503 },
);
}
}Why this works:
- No auth needed, so monitors don't need to carry tokens.
- Cheap: a single
SELECT 1is microseconds. - Realistic: it actually exercises your DB connection pool, env vars, and runtime — the things most likely to break on a bad deploy.
- 503 on failure: a clear signal that the underlying stack, not the network, is the problem.
HEAD vs GET vs POST
For most monitoring, use GET. Reasons:
- Many APIs return 405 to HEAD, breaking your probe for no reason.
- Some frameworks skip middleware on HEAD, so HEAD wouldn't exercise the same path GET does.
- GET is what real clients send. Probe with what your users actually use.
Use POST only if you have a webhook receiver that requires it, and you've configured a test payload that the receiver can safely accept.
Status code expectations
SitePulse's default expected range is 200-399. That covers:
- 200-299: everything is fine.
- 301/302/307/308: redirects (e.g., HTTP→HTTPS, locale routing). The probe follows the redirect by default.
- 304: not modified — also "up" if your endpoint supports conditional GET.
Narrow this if you have specific assertions:
- 200-299 only: catch unintended redirects (e.g., accidental HTTPS→HTTP fallback).
- 401-401 only: useful for verifying that a protected endpoint is still correctly rejecting unauthenticated requests.
Keyword matching: when status code isn't enough
The classic API failure: your endpoint returns 200 with { "error": "DB connection lost" }. Status-code monitoring sees green. Customers see red.
The fix is keyword matching. SitePulse supports two modes:
- Must be present: assert a phrase exists. e.g.,
"ok":truefor a JSON heartbeat. - Must NOT be present: assert a phrase is absent. e.g.,
"error"in the response body.
For most JSON APIs, "must be present" is the right call:
Keyword: "ok":true
Mode: Must be present
Result: If response body doesn't contain "ok":true,
monitor flips to down regardless of status code.GraphQL: assume status code is a lie
GraphQL endpoints almost always return 200 OK even when the query failed. The error is in the JSON body under errors.
Two approaches:
Option A: keyword-match the response body
POST a probe query (introspection or a trivial known-good query) and assert the response contains data and does not contain errors.
URL: https://your-api.com/graphql
Method: POST
Body: {"query": "{ __typename }"}
Headers: Content-Type: application/json
Keyword: "data"
Must be presentOption B: a sibling /health endpoint
Build an HTTP /api/health endpoint that runs the GraphQL introspection query server-side and returns 200/503 based on whether it succeeded. Then monitor /health with status-code monitoring like any boring REST endpoint.
Option B is usually cleaner — it abstracts the GraphQL quirks behind a normal HTTP shape.
Auth: tokens vs no-auth /health
For protected endpoints, you have two options:
Option A: probe a no-auth /health endpoint
Build /api/health that does the same critical work as an auth-protected endpoint, without requiring a token. This is what most teams do, because it avoids token rotation pain.
Option B: include an auth header in the probe
Generate a long-lived API token specifically for monitoring (one that has read-only or limited permissions), and add it to your monitor's custom headers:
URL: https://your-api.com/api/me Method: GET Custom header: Authorization: Bearer <monitoring-token> Expected: 200
SitePulse supports custom headers per monitor. The catch: tokens expire or get revoked, and rotating them is a manual chore. Pick Option A unless you have a specific reason.
Rate limits: don't probe what you can't spare
A 1-minute probe = 60 requests/hour from one source. For most APIs that's fine. For aggressive rate limiters (e.g., 100 requests/hour per IP), it's 60% of your budget.
How to handle it:
- Probe /health, which bypasses your rate limiter. Most rate limiters apply to authenticated routes only, leaving /health open by design.
- Allowlist the SitePulseBot user-agentso probes don't count against per-IP limits. SitePulse identifies itself as
SitePulseBot/1.0 (+https://sitepulse.satosushi.co). - Increase the probe interval for endpoints with tight limits. 5-min probes = 12 requests/hour, much friendlier.
Webhooks: monitor the receiver, not the trigger
Webhook endpoints are designed to accept POST from a specific third party. You can't probe them with GET (most return 404/405) and you can't POST a fake payload (signature verification will reject it).
The clean pattern is a sibling /health endpoint:
POST /webhook/stripe (real Stripe sends here) GET /webhook/stripe/health (your monitor probes here)
The /health endpoint exercises the same downstream logic the webhook handler does — DB connection, internal queue, etc. — but without needing a real signed payload. If /health is up, the webhook receiver is almost certainly up too.
Timeout calibration
Default 15 seconds works for most APIs. Specific cases:
- 5-10 sec: latency-sensitive APIs where slow responses are a regression you want flagged.
- 15-20 sec: typical REST endpoints with one or two DB queries.
- 30-60 sec: endpoints that legitimately do heavy work (image processing, batch jobs). Better still: split out a /health endpoint that doesn't do the heavy work and probe that.
What good API monitoring looks like (checklist)
- One
/api/healthendpoint per service. - /health does a real DB query and (optionally) a critical-dep touch.
- /health is no-auth and rate-limit-exempt.
- Probe at 1-min interval (or 5-min on Free tiers).
- Status range 200-299, GET method.
- Keyword match on a known-good substring (
"ok":true). - 15-second timeout.
- Failure threshold 1-2 (alert quickly, but tolerate single-blip noise).
- Email alerts (or webhooks if you have a Slack/Discord channel).
- Test the alerting path once — manually break /health, verify the email arrives.
Setting this up in SitePulse
Every pattern in this guide works in SitePulse's monitor settings — keyword matching, custom headers, configurable status ranges, custom timeouts, failure thresholds.
Sign up free and the first monitor you add can be your /health endpoint. Free tier covers 5 monitors at 5-min intervals — enough to apply this playbook to a typical indie SaaS without paying anything.
Frequently asked questions
Should I use HEAD or GET for API monitoring?+
GET, almost always. HEAD is cheaper but many APIs handle it differently — some return 405 Method Not Allowed, some return 200 with no body, some skip middleware that GET would exercise. GET is what real clients use, so it's what you should probe with. Save the bandwidth on a /health endpoint that returns 50 bytes of JSON instead of fighting HEAD inconsistencies.
What status codes should count as 'up'?+
Default: 200-399. This covers OK, redirects, and 'created/accepted' codes. Status 4xx (client error) and 5xx (server error) count as down by default in SitePulse. The exception: if your endpoint legitimately returns a non-200 (e.g., a 401 endpoint that you want to verify still 401s correctly), narrow the expected range to that specific code in monitor settings.
How do I monitor an authenticated endpoint?+
Two approaches. Easiest: build a public /api/health endpoint that does the same critical work behind the scenes (DB query, third-party call) but doesn't require auth. Monitor that. Alternative: generate a long-lived API token, store it in your monitor's custom headers (Authorization: Bearer <token>), and probe the real auth-protected endpoint. Token rotation becomes a chore, so most teams pick option 1.
Do GraphQL endpoints need different treatment?+
Yes. GraphQL endpoints typically return 200 OK even when the query failed — the error is in the JSON body. Plain HTTP-status monitoring will miss real failures. Use keyword matching to assert the response body contains an expected field (e.g., 'data':) and does not contain 'errors'. Or build an HTTP /health endpoint that runs a GraphQL introspection query server-side and returns 200/503.
How do I avoid rate-limit issues?+
Three things: (1) probe a /health endpoint that bypasses your rate limiter, not a customer-facing route; (2) if you must probe a rate-limited endpoint, ensure your monitor's request volume (60/hour at 1-min interval) is well under the limit; (3) some monitor services let you set custom request methods or paths to avoid your normal rate-limiter signature. SitePulse identifies itself with a SitePulseBot user-agent so you can allowlist it.
What about webhook endpoints?+
Webhooks are tricky because most return 200 to a POST and 404/405 to a GET. You can't probe a webhook receiver without sending a real (or test) payload. Two options: (a) build a sibling /health endpoint that exercises the same downstream logic without needing a real webhook; (b) configure your webhook receiver to accept GET as a no-op-then-200 health probe. Option (a) is cleaner.
How long should the probe timeout be?+
Default 15 seconds is reasonable for most APIs. Set lower (5-10s) for endpoints that should always be fast — slow probes hide latency degradation. Set higher (30-60s) for endpoints that legitimately do heavy work (image processing, batch operations). The point of a timeout is 'this is taking longer than the user would tolerate' — calibrate to your endpoint's normal SLA.
Related
Try SitePulse free
5 monitors, 5-minute checks, email alerts, public status page — free forever. No credit card.