Webhook Reliability: Ensuring Your API Integrations Never Silently Break

Webhooks are fundamentally optimistic. A vendor sends a POST request to your endpoint, assumes it was received, and moves on. If your server was down, the payload got dropped. If your parsing code threw an error, the vendor may never know. If the webhook schema changed two weeks ago and your code is silently discarding the new field, your downstream logic has been broken the whole time.

This is what "silently break" means for webhook integrations: no exceptions thrown, no alerts fired, just quietly wrong behavior.

Building reliable webhook consumers requires layered defenses: delivery verification, idempotent processing, schema monitoring, and structured observability. This guide covers each layer.


Understanding Webhook Delivery Semantics

Before building reliability into your consumer, understand what delivery guarantee your vendor provides.

At-most-once delivery: The vendor fires once and doesn't retry. If your endpoint returns 5xx, the event is lost. Stripe's default behavior when your endpoint is down.

At-least-once delivery: The vendor retries on failure until you return 2xx. You may receive the same event multiple times. GitHub webhooks, AWS SNS, most enterprise vendors.

Exactly-once delivery: Theoretically desirable, practically nonexistent in third-party webhooks. Don't assume it.

The implication: build for at-least-once. Use idempotency keys to handle duplicate delivery safely.


Layer 1: Signature Verification

Every production webhook consumer must verify that payloads are genuinely from the claimed vendor. Without this, anyone who discovers your webhook URL can send arbitrary payloads.

Most vendors use HMAC-SHA256 signatures. The pattern is consistent across vendors:

import crypto from 'crypto';

function verifyStripeWebhook(
  payload: Buffer,
  signature: string,
  secret: string
): boolean {
  const elements = signature.split(',');
  const timestamp = elements.find(e => e.startsWith('t='))?.slice(2);
  const sig = elements.find(e => e.startsWith('v1='))?.slice(3);

  if (!timestamp || !sig) return false;

  // Reject events older than 5 minutes (replay attack protection)
  const age = Math.abs(Date.now() / 1000 - Number(timestamp));
  if (age > 300) return false;

  const expectedSig = crypto
    .createHmac('sha256', secret)
    .update(`${timestamp}.${payload.toString()}`)
    .digest('hex');

  return crypto.timingSafeEqual(
    Buffer.from(sig, 'hex'),
    Buffer.from(expectedSig, 'hex')
  );
}

Key details:

// Express: read raw body for signature verification
app.use('/webhooks', express.raw({ type: 'application/json' }));

app.post('/webhooks/stripe', (req, res) => {
  const sig = req.headers['stripe-signature'] as string;

  if (!verifyStripeWebhook(req.body, sig, process.env.STRIPE_WEBHOOK_SECRET)) {
    return res.status(401).json({ error: 'Invalid signature' });
  }

  const event = JSON.parse(req.body.toString());
  // proceed with processing
});

Layer 2: Respond Fast, Process Async

Your webhook endpoint should return 200 within 3-5 seconds or the vendor will consider it failed and retry. Don't do heavy processing synchronously.

app.post('/webhooks/stripe', async (req, res) => {
  // Verify signature synchronously
  if (!verifySignature(req)) {
    return res.status(401).end();
  }

  const event = JSON.parse(req.body.toString());

  // Respond immediately
  res.status(200).end();

  // Process async
  await queue.add('process-stripe-event', { event });
});

Use a queue (BullMQ, SQS, Inngest, etc.) to decouple receipt from processing. This also gives you automatic retry on processing failures without the vendor ever knowing.


Layer 3: Idempotent Processing

With at-least-once delivery, you must handle duplicate events safely. The standard approach: track processed event IDs and skip duplicates.

async function processStripeEvent(event: Stripe.Event): Promise<void> {
  // Check if already processed
  const existing = await db.query(
    'SELECT id FROM processed_webhook_events WHERE event_id = $1',
    [event.id]
  );

  if (existing.rows.length > 0) {
    logger.info('Skipping duplicate webhook event', { eventId: event.id });
    return;
  }

  // Process the event
  await handleStripeEvent(event);

  // Mark as processed (use a transaction with your actual business logic)
  await db.query(
    'INSERT INTO processed_webhook_events (event_id, processed_at) VALUES ($1, NOW())',
    [event.id]
  );
}

For critical operations (e.g., provisioning a paid feature after a successful payment), wrap the business logic and the "mark processed" step in a single transaction:

await db.transaction(async (tx) => {
  await tx.query('INSERT INTO processed_webhook_events (event_id) VALUES ($1)', [eventId]);
  await tx.query('UPDATE subscriptions SET status = $1 WHERE customer_id = $2', ['active', customerId]);
});

Schema for tracking:

CREATE TABLE processed_webhook_events (
  event_id TEXT PRIMARY KEY,
  source TEXT NOT NULL,       -- 'stripe', 'github', etc.
  processed_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  payload JSONB               -- optional, for debugging
);

CREATE INDEX ON processed_webhook_events (processed_at);

Layer 4: Schema Monitoring

Here's the silent failure mode nobody talks about enough: your webhook consumer processes events without errors, but the schema changed and you're silently losing data.

Example: A vendor adds processing_fee as a new field to their payment event payload. Your code doesn't reference it, so no errors. But your billing reconciliation that relies on that data is now incomplete.

Or more dangerously: a field you do reference changed type — amount went from a number to a string with a currency suffix ("2000 USD"). Your code parses it, gets NaN on arithmetic, swallows the error, and your revenue calculations are wrong.

Schema validation with Zod catches this at parse time:

import { z } from 'zod';

const StripePaymentIntentSchema = z.object({
  id: z.string(),
  object: z.literal('payment_intent'),
  amount: z.number(),
  currency: z.string(),
  status: z.enum(['requires_payment_method', 'processing', 'succeeded', 'canceled']),
  metadata: z.record(z.string()),
});

function parseStripePaymentIntent(raw: unknown) {
  const result = StripePaymentIntentSchema.safeParse(raw);
  if (!result.success) {
    // Vendor schema may have changed — alert on-call
    logger.error('Stripe payment_intent schema validation failed', {
      errors: result.error.issues,
      payload: raw,
    });
    throw new SchemaValidationError('Unexpected Stripe webhook schema');
  }
  return result.data;
}

Continuous endpoint monitoring catches schema changes before they reach your consumer. Use Rumbliq to monitor the equivalent REST endpoints for your webhooks:

Monitor: https://api.stripe.com/v1/payment_intents/{id}
Schedule: every 5 minutes
Alert: email + Slack when schema changes

When Rumbliq detects that the endpoint response structure has changed, you know to review your webhook consumer before the changed events start flowing in.


Layer 5: Observability

Webhook pipelines need structured observability to catch failures that slip through.

Log everything with correlation IDs:

app.post('/webhooks/:vendor', async (req, res) => {
  const correlationId = req.headers['x-request-id'] || crypto.randomUUID();
  const logger = baseLogger.child({
    correlationId,
    vendor: req.params.vendor,
  });

  logger.info('Webhook received');
  res.status(200).end();

  try {
    await processWebhook(req.body, logger);
    logger.info('Webhook processed successfully');
  } catch (err) {
    logger.error('Webhook processing failed', { error: err });
  }
});

Track key metrics:

Metric What It Catches
webhook.received.count by vendor Traffic patterns, unexpected drops
webhook.processing.duration Performance regressions
webhook.schema_validation.failures Schema drift
webhook.processing.errors by error type Business logic failures
webhook.duplicate.count Vendor retry storms
webhook.dedup.age_ms Slow dedup lookups

Set alerts on:


Layer 6: Handling Vendor Outages and Backlogs

Vendors sometimes batch-deliver missed events when your endpoint comes back online after downtime. Design your consumer to handle sudden event surges without saturating your database.

// Rate-limit webhook processing, not receipt
const processingQueue = new BullMQ.Queue('webhooks', {
  limiter: {
    max: 50,        // max 50 concurrent jobs
    duration: 1000, // per second
  }
});

// Process webhook backlog with controlled parallelism
const worker = new BullMQ.Worker('webhooks', processWebhookJob, {
  concurrency: 10,
  limiter: {
    max: 50,
    duration: 1000,
  }
});

Also implement an exponential backoff for your own retry logic if downstream dependencies (your database, your own APIs) are temporarily unavailable during backlog processing.


Webhook Reliability Checklist

Before going to production with a new webhook integration, verify:


Tools and Resources

Related Posts


Webhooks seem simple until they don't work. The teams that catch issues early are the ones who treat webhook reliability like any other distributed systems problem: with monitoring, schema validation, idempotency, and observability from day one.

Monitor your critical API endpoints with Rumbliq and get alerted when schema changes would affect your webhook consumers — before the events start flowing.