API Alerting Best Practices: When to Alert, Who to Notify, and How to Avoid Alert Fatigue

The first sign that your alerting is broken isn't that you miss an incident. It's when an engineer says "oh, that alert? I stopped looking at it months ago."

Alert fatigue is the slow-motion failure of on-call culture. It happens when alerts are too frequent, too noisy, or too low-signal. Engineers learn — rationally — that most alerts don't require action, and they start ignoring them. Then the one alert that did require action gets ignored too.

Building good API alerting is as much about not alerting as it is about alerting. This guide covers the principles and mechanics of API alerting that's actually useful.


The Fundamental Principle: Every Alert Must Be Actionable

Before you create any alert, ask: If this fires at 2am, what exactly would the on-call engineer do?

If the answer is "check the dashboard and probably go back to sleep," that's not an alert — it's a notification. Route it to a Slack channel for daytime review, not to PagerDuty.

If the answer is "restart service X" or "roll back the deployment from 6pm" or "contact the third-party vendor about their API outage," that's an actionable alert. It wakes someone up for a reason.

This single principle eliminates most alert fatigue. If you can't write a concrete runbook for an alert, it shouldn't page anyone.


Alert Severity: The Three Levels That Actually Work

Most alerting systems support five or more severity levels. In practice, teams work better with three:

Critical (P1) — Page immediately, any time

The service is down or severely degraded. Revenue is being lost. Users cannot complete core flows. Every second matters.

Routing: On-call engineer, immediate escalation path, team lead as backup.

API examples:

Response time expectation: Acknowledge within 5 minutes, first action within 15.

High (P2) — Page during business hours, night shift optional

Something is wrong and getting worse, but the service is functional. There's a time window before this becomes critical.

Routing: On-call engineer during business hours. Optional page during off-hours if trend is worsening.

API examples:

Response time expectation: Acknowledge within 1 hour, resolve within 4 hours.

Low (P3) — Slack notification, review next business day

Something worth knowing, but not urgent. No user impact right now.

Routing: Team Slack channel. No paging.

API examples:

Response time expectation: Review within 1 business day.


What to Alert On for API Monitoring

1. Availability (Uptime)

The most fundamental check: is the API responding at all?

Don't alert on a single failure. Network blips, brief maintenance windows, and load spikes cause transient errors. Alert when failures persist across multiple consecutive checks.

A common pattern:

This eliminates false positives from one-off network issues while still catching real outages within a few minutes.

2. Response Time / Latency

Latency alerts are tricky because latency is naturally variable. Alert on meaningful degradation, not minor fluctuations.

Good approaches:

Avoid: Alerting on absolute millisecond values without context. An endpoint that normally takes 800ms and is now taking 900ms is not an incident. An endpoint that normally takes 200ms and is now taking 1,800ms probably is.

3. HTTP Error Rates

Track the rate of 4xx and 5xx responses, not just their existence.

Set error rate thresholds relative to traffic volume. On a lightly-used internal API, 3 errors in a row is significant. On a high-traffic endpoint, a 0.5% error rate might be acceptable while 2% is not.

4. Schema / Structural Changes

For APIs that return structured data (JSON, GraphQL), structural changes are often the most impactful thing to alert on — and the thing most teams don't monitor at all.

When a third-party API removes a field your application depends on, there's no HTTP error. Your API is healthy. Your application is broken.

Alert on:

Rumbliq monitors these structural changes automatically, diffing each response against a stored baseline and alerting on drift. This catches the class of API breakage that HTTP monitoring misses entirely.

5. SSL Certificate Expiration

SSL certificate expiry is entirely preventable, fully predictable, and still takes down production services with regularity.

Alert at:

Auto-renewal via Let's Encrypt handles most cases, but third-party APIs you integrate with can let their certs expire too. Monitor the SSL expiry on your critical upstream dependencies, not just your own endpoints.


Alert Routing: Who Gets Notified and How

Match the channel to the urgency

Severity Primary Channel Backup
P1 Critical PagerDuty / phone call Manager escalation
P2 High PagerDuty (business hours) Slack (off-hours)
P3 Low Slack channel Email digest
Informational Slack / email

Don't route P3 alerts to PagerDuty. Don't route P1 alerts to Slack (people miss messages when they're asleep).

Route to the right team, not just the on-call person

A schema change in your Stripe integration should go to the payments team, not the generic on-call. A latency spike in the recommendations service should go to the ML team.

In Rumbliq, you can create separate alert destinations for different monitors — route your payment API monitors to the payments Slack channel and PagerDuty escalation policy, and your internal tooling monitors to a lower-urgency channel.

Use webhooks for custom routing logic

If your incident management system has routing rules more sophisticated than "send to this email," use webhooks. Most monitoring tools — including Rumbliq — support webhook alert destinations where you POST alert data to your own endpoint and route from there.

{
  "monitor_name": "Stripe Payment Intent API",
  "alert_type": "schema_change",
  "severity": "high",
  "change": {
    "removed_fields": ["payment_method.card.three_d_secure"],
    "added_fields": [],
    "type_changes": []
  },
  "timestamp": "2026-04-01T14:23:11Z"
}

Your webhook receiver can route based on monitor_name, alert_type, time of day, or any other field.


Preventing Alert Fatigue: Practical Techniques

1. Require a runbook for every alert

Before an alert goes live, someone must write the runbook. Even if it's three bullet points: "1. Check status page. 2. Check error logs. 3. If ongoing >10min, contact vendor support."

Requiring a runbook forces the team to think about whether the alert is actually actionable. If you can't write a runbook, the alert shouldn't exist.

2. Audit your alerts quarterly

Schedule a 30-minute team review every quarter. For each alert: How many times did it fire in the past 90 days? How many of those were real incidents? What was the action taken?

Any alert that fired >20 times with zero action taken is either:

3. Use time windows and flap detection

An endpoint that alternates between healthy and failing every 30 seconds is "flapping." Flapping generates alert storms that train engineers to ignore alerts.

Use minimum duration requirements (must be unhealthy for 3+ consecutive minutes before alerting) and flap detection (suppress alerts if the state has changed more than 5 times in the past 10 minutes).

4. Alert on rates, not raw counts

"5 errors in the past minute" might be significant for a low-traffic internal API and completely normal for a high-traffic endpoint. "3% error rate" is contextually meaningful regardless of traffic volume.

Use rates and percentages for error and latency alerts. Reserve count-based alerts for things that should truly never happen (e.g., "more than 0 authentication bypass events").

5. Consolidate duplicate signals

If your payment API is down, you probably have:

That's four pages for one incident. Group related alerts and suppress duplicates. Most incident management platforms support alert grouping — use it.

6. Schedule maintenance windows

Planned deployments, dependency upgrades, and infrastructure maintenance will cause transient failures. Pre-schedule maintenance windows in your monitoring system to suppress alerts during known-downtime periods.

Failing to do this means engineers learn that some alerts during deploy windows are false positives — which trains them to be skeptical of all alerts during deploys. That skepticism then causes missed incidents.


Writing Effective Alert Messages

A good alert message answers five questions immediately:

  1. What broke? (the specific service, endpoint, or field)
  2. How bad is it? (error rate, duration, impact)
  3. When did it start? (timestamp)
  4. What changed recently? (recent deployments, dependency changes)
  5. What's the first step? (link to runbook)

Bad alert:

ALERT: API error detected
Monitor: api-monitor-1
Status: Failing

Good alert:

[P1] Stripe Payment API — Schema Change Detected

Field removed: payment_method.card.three_d_secure (was String)
New fields: payment_method.card.networks.preferred (String)
Detected: 2026-04-01 14:23 UTC (3 minutes ago)
Monitor: Stripe Charge Create

Impact: Clients reading three_d_secure may get null/error
Runbook: https://internal.wiki/runbooks/stripe-schema-change
Dashboard: https://app.rumbliq.com/monitors/mon_abc123

The second alert is actionable in 10 seconds. The first requires investigation before you even know what you're dealing with.


A Practical Alert Audit Template

Use this to evaluate each existing alert:

Question Yes No
Is there a written runbook? Keep Write one or delete
Did it produce a real incident in the past 90 days? Keep Review threshold
Is the routing correct (right team, right severity)? Keep Fix routing
Does the alert message explain what happened? Keep Improve message
Does it have flap protection / duration requirements? Keep Add

Run this review on your full alert inventory and you'll typically eliminate 30-50% of alerts while dramatically improving the signal quality of what remains.


Summary

Good API alerting is disciplined, not comprehensive. Every alert should be:

The most important alerts for API monitoring are availability (with flap detection), latency degradation (rate-based, not absolute), error rate thresholds, and — especially for third-party integrations — structural schema changes. Rumbliq handles all four with configurable thresholds and flexible alert routing to Slack, webhooks, and email.

Start with fewer, better alerts. You can always add more. It's much harder to convince engineers to start trusting alerts again after you've trained them to ignore them.

Related Posts

Start monitoring your APIs free → — 25 monitors, 3 sequences, no credit card required.