How to Detect Webhook Delivery Failures Before Your Customers Do

Webhooks are a terrible mechanism for critical business logic — except there's nothing better.

When Stripe delivers a payment_intent.succeeded event, your application needs to fulfil an order, provision access, and send a confirmation email. When GitHub fires a push event, your CI pipeline needs to trigger. When Twilio delivers message.delivered, your user gets marked as notified. If any of these webhooks silently stop arriving, your system doesn't crash — it just stops doing the right thing.

The signature failure mode of webhook problems: everything looks normal in your monitoring dashboard. Your application is healthy. Your API endpoints return 200s. But somewhere downstream, orders aren't fulfilling, pipelines aren't triggering, and customers are confused about why nothing happened.

You're almost always the last to know.

This guide is about flipping that. Here's how to build a webhook monitoring setup that detects failures before they become customer complaints.

Why Webhook Failures Are Hard to Detect

Webhook failures are hard to detect because webhooks invert the normal request/response model.

With standard API calls, you initiate the request. If it fails, you get an error. You log it. Your monitoring catches the error rate.

With webhooks, the third-party initiates the request. If they stop initiating it — or if the payload structure changes — your application does nothing. There's no error to catch. No error rate to spike. Just silence.

The Three Ways Webhooks Fail

Delivery failure. The webhook never arrives at your endpoint. Either your endpoint is down, the third-party service has a delivery outage, or something between them (firewall, DNS, load balancer) is blocking the request. This is the most obvious failure — your inbound request rate drops to zero.

Payload schema drift. The webhook arrives. Your endpoint returns 200. But the payload structure changed — a field was renamed, a nested object was restructured, an enum gained new values. Your handler processes the webhook "successfully" but with incorrect or incomplete data. This is the failure mode that ages silently for weeks.

Processing failure. The webhook arrives, but your handler code crashes or produces incorrect side effects. You may be returning 200 (swallowing the error) without successfully completing the business logic. The third-party thinks the webhook was delivered successfully. You have no idea it failed.

Each failure type requires a different detection approach.

How to Monitor Webhook Delivery Rate

The most fundamental webhook metric: are webhooks arriving at the expected rate?

If you process Stripe payment_intent.succeeded events and you normally receive 50 per hour during business hours, a sudden drop to zero should fire an alert. Not after 2 hours — within 10 minutes.

Set Up Inbound Rate Monitoring

At the application level, instrument your webhook endpoint to emit a counter metric for every received event:

app.post('/webhooks/stripe', async (c) => {
  metrics.increment('webhook.received', { provider: 'stripe', event: eventType });

  // ... handle webhook
  
  return c.json({ received: true });
});

At the infrastructure level, your load balancer or API gateway likely has access logs. Webhook endpoint request rates are visible here without any code changes.

Monitor for absence, not just presence. Most monitoring tools let you alert when a metric drops below a threshold. Set a low-watermark alert: if webhook receipt rate drops below X per hour during business hours, fire an alert.

Verify Provider Delivery Logs

Every major webhook provider exposes delivery logs:

Stripe: Dashboard → Developers → Webhooks → Event delivery logs
GitHub: Repository → Settings → Webhooks → Recent deliveries
Twilio: Monitor console → Debugger → Webhook logs

Check these logs whenever you suspect a delivery problem. If the provider shows successful delivery but your application has no record of receiving the event, the problem is in your infrastructure (load balancer, server, firewall).

How to Monitor for Webhook Payload Schema Drift

Payload schema drift is the most dangerous webhook failure because it's invisible to standard monitoring. Your endpoint returns 200. The provider marks delivery as successful. Somewhere in your business logic, data is silently wrong.

Schema Validation on Ingestion

Add explicit schema validation at your webhook endpoint, before any business logic runs:

import { z } from 'zod';

const StripePaymentIntentSchema = z.object({
  id: z.string(),
  type: z.string(),
  data: z.object({
    object: z.object({
      id: z.string(),
      amount: z.number(),
      currency: z.string(),
      status: z.enum(['requires_payment_method', 'requires_confirmation', 'requires_action', 'processing', 'requires_capture', 'canceled', 'succeeded']),
      payment_method: z.string().nullable(),
    })
  })
});

app.post('/webhooks/stripe', async (c) => {
  const body = await c.req.json();
  const result = StripePaymentIntentSchema.safeParse(body);
  
  if (!result.success) {
    // Schema drift detected — alert immediately
    logger.error('Webhook schema mismatch', { 
      errors: result.error.issues,
      payload: body 
    });
    metrics.increment('webhook.schema_mismatch', { provider: 'stripe' });
    
    // Still return 200 so provider doesn't retry, but alert your team
    return c.json({ received: true });
  }
  
  // Process with validated payload
  await processPaymentIntent(result.data);
  return c.json({ received: true });
});

This catches schema drift at the point of delivery. When Stripe adds, removes, or renames a field, your schema validation fires and you know immediately.

Delivery Monitoring with Rumbliq

Schema validation catches drift reactively — after it's already happening. To catch delivery failures (the silence that standard monitoring can't see), use Rumbliq as a passive relay in front of your real endpoint.

You create a webhook endpoint in Rumbliq and point your provider at the Rumbliq ingest URL instead of your own. Every delivery flows through Rumbliq: it logs the delivery, forwards the payload (signature headers intact) to your real endpoint, records the forwarded response status and latency, and — if you tell it how often you expect deliveries — alerts you when an expected delivery goes silent past a grace period.

# Create a webhook endpoint. forward_url is your real handler.
# expected_interval_seconds tells Rumbliq how often to expect deliveries;
# omit it (or set 0) for pure pass-through logging with no silence alerts.
curl -X POST https://rumbliq.com/v1/webhook-endpoints \
  -H "Authorization: Bearer dk_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Stripe payment events",
    "forward_url": "https://api.yourapp.com/webhooks/stripe",
    "source": "stripe",
    "expected_interval_seconds": 3600,
    "grace_seconds": 300
  }'

The response includes a token. Point your provider's webhook URL at the Rumbliq ingest endpoint for that token:

https://rumbliq.com/v1/webhook-in/<token>

Now Stripe (or GitHub, Twilio, etc.) delivers to Rumbliq, Rumbliq forwards to your handler, and you get a delivery log with forwarded status, latency, and payload size for every event. If no delivery arrives within expected_interval_seconds + grace_seconds, Rumbliq fires a breaking-severity alert — that's your silent-failure detector.

Additionally, monitor the API endpoints behind your webhooks. When Stripe fires payment_intent.succeeded, your handler often makes follow-up API calls (GET /v1/payment_intents/{id}, GET /v1/customers/{id}). Monitoring these downstream endpoints for schema drift catches problems before your handler code fails on a live webhook.

Provider-Specific Webhook Monitoring

Stripe Webhooks

Stripe is the highest-stakes webhook source for most SaaS applications. Payment failures compound quickly.

What to monitor:

payment_intent.succeeded and payment_intent.payment_failed delivery rate
customer.subscription.updated and customer.subscription.deleted for lifecycle events
invoice.payment_failed for dunning workflows

Stripe-specific risks:

Stripe's webhook events have expanded significantly with each API version. If you're not on the latest API version, you may receive events with different schemas than current Stripe documentation describes.
Stripe sends events to the oldest unacknowledged version of your endpoint during upgrades.
Stripe retries failed deliveries with exponential backoff — but only for up to 3 days. Events that fail silently (5xx responses you're swallowing) accumulate and get permanently lost.

GitHub Webhooks

GitHub webhooks power CI pipelines, deployment systems, and code review automations. Silent failures mean delayed builds and missed review triggers.

What to monitor:

push and pull_request delivery rate during active development hours
Delivery failures in GitHub's webhook log (accessible via API: GET /repos/{owner}/{repo}/hooks/{hook_id}/deliveries)
Signature validation failures (sign that your webhook secret needs rotation)

Twilio Webhooks

Twilio webhooks are used for inbound SMS handling, call status events, and delivery notifications. Failures silently break two-way communication flows.

Twilio-specific risks:

Twilio requires your webhook endpoint to respond within 15 seconds or the delivery is marked as failed
Twilio webhooks use application/x-www-form-urlencoded encoding, not JSON — a common parsing bug when endpoints expect JSON

Building a Webhook Monitoring Dashboard

Effective webhook monitoring needs visibility across three dimensions:

Delivery rate by provider and event type. You want to see: "Stripe payment_intent.succeeded events received in the last hour: 47". If this number is lower than your historical baseline for this time of day, something is wrong.

Schema validation failure rate. How often are incoming webhooks failing your schema validation? Track this as both a count and a rate (% of all webhooks received). An uptick is an early signal of schema drift.

Handler processing success rate. Are webhooks being received and passing schema validation but failing in business logic? Track a webhook.processed_successfully metric separately from webhook.received.

Most observability platforms (Datadog, Grafana, even simple logging) can visualize these metrics. The key is instrumenting your webhook handlers to emit them in the first place.

The Monitoring Stack That Catches Failures First

A complete webhook monitoring setup uses multiple layers:

Delivery relay monitoring (Rumbliq): receives every webhook, forwards it to your endpoint, logs forward status and latency, and alerts when an expected delivery goes silent
Schema validation in code: catches payload structure changes on first delivery
Delivery rate alerting: fires when inbound webhook volume drops below baseline
Provider dashboard monitoring: secondary confirmation when something seems off
Business metric monitoring: downstream signals (order fulfillment rate, notification delivery rate) that catch failures the above layers miss

None of these layers is sufficient alone. Together, they give you detection within minutes instead of discovering the problem when a customer asks why their payment didn't process.

Related reading:

FAQ

How do I know if my webhooks are being delivered?

Check three sources: your application logs (are events being received?), your webhook endpoint metrics (inbound request rate), and the provider's delivery dashboard (Stripe, GitHub, and Twilio all show webhook delivery history). If your application shows no received events but the provider shows successful delivery, the problem is in your infrastructure. If the provider shows delivery failures, the issue is your endpoint availability.

What should I do when webhook payloads change unexpectedly?

First, check the provider's changelog and API versioning documentation to understand if this is an intentional change. Then update your schema validation and handler code to accommodate the new payload structure. Add backward compatibility during transition periods if the old payload is still sometimes returned. Finally, add a test case to your integration tests that validates the new payload structure to prevent regression.

Should I return 200 even if I fail to process a webhook?

It depends. If processing failed due to a transient error (database unavailable, downstream service timeout), return a 5xx so the provider retries delivery. If processing failed due to a schema mismatch or validation error you want to investigate without retrying, returning 200 prevents duplicate processing while you fix the handler. The key is to always log the failure internally even when returning 200, and to alert on schema validation failures immediately.

How often do webhook payload schemas change?

More often than you'd expect. Major providers like Stripe ship API updates dozens of times per year. Many webhook schema changes are considered "non-breaking" by the provider (adding new fields, new enum values) but can break tightly-coupled handler code. Schema drift monitoring on your webhook-related API endpoints, combined with schema validation in your webhook handlers, is the only reliable way to catch these changes quickly.

What's the best way to test webhook handling in development?

Use the provider's test webhook tools: Stripe's CLI lets you forward live events to localhost, GitHub lets you redeliver past events, and Twilio has test credentials that send synthetic events. Write integration tests that fire webhook payloads at your handler endpoint directly. Use schema validation in your tests to verify your handler correctly processes both current and edge-case payloads. Update these tests whenever you update your schema expectations.

Rumbliq monitors your webhook endpoints and third-party API schemas — detecting delivery failures and payload drift before your customers do. Start monitoring free →