From API Outage to 99.99% Uptime: Building a Monitoring Stack with Rumbliq

In Q3 2024, a developer platform team had three API-related production incidents in six weeks. None of them were caused by their own code failing. All three were caused by silent schema changes in third-party APIs they depended on — changes that broke code that had been running correctly for months.

This is the story of how they went from that state to 99.99% uptime on their external integrations, and what the monitoring stack looks like.

The Incident Pattern

The three incidents followed the same pattern:

A third-party API silently changes something in its response
The team's application code, which was written against the old schema, starts behaving incorrectly
Users notice the problem — checkout fails, notifications aren't sent, data displays incorrectly
Engineering gets paged; incident response begins
Root cause analysis eventually traces back to a third-party API change
Code is updated; incident is closed
Repeat in 2-3 weeks

Each incident took 3-6 hours from first alert to resolution. The longest took 9 hours because the schema change had occurred 4 days earlier and nobody noticed until a specific edge-case triggered it.

The post-mortems all said the same thing: "We need better visibility into third-party API changes." But nobody had a concrete implementation plan.

What the Old Stack Was Missing

The team already had solid observability:

Datadog: Infrastructure metrics, APM traces, uptime checks
Sentry: Application error tracking with stack traces
PagerDuty: On-call rotation and escalation policies
StatusPage: Customer-facing incident communication

This is a legitimately good observability stack. It did exactly what it was designed to do.

What none of it did: monitor whether the shape of third-party API responses was changing.

Datadog's uptime checks verify that an endpoint returns 200. They don't check whether the JSON response contains the fields your code expects. Sentry captures errors after they happen. Neither tool watches a third-party API endpoint and tells you when its schema drifts.

The Missing Layer: Schema Drift Monitoring

The team added Rumbliq as the fourth layer of their monitoring stack, specifically focused on third-party API schema integrity.

The conceptual model:

Layer 1 — Infrastructure (Datadog)
  ↓ catches: service failures, latency spikes, resource exhaustion

Layer 2 — Application (Sentry)
  ↓ catches: runtime exceptions, unhandled errors after they occur

Layer 3 — User Experience (StatusPage + synthetic monitoring)
  ↓ catches: user-visible failures after they occur

Layer 4 — Schema Drift (Rumbliq) ← NEW
  ↓ catches: third-party API schema changes BEFORE they cause failures

The key distinction of Layer 4: it operates before failures occur. Layers 1-3 are reactive. Layer 4 is proactive.

Building the Schema Monitoring Layer

Endpoint Inventory

The first step was a complete inventory of third-party API endpoints their application code read from. This alone was revealing — the team identified 34 distinct third-party endpoints across 8 services, many of which had never been explicitly documented.

Payment flows:     8 Stripe endpoints
Auth flows:        6 Auth0 endpoints
Data sync:         7 Salesforce endpoints
Notifications:     5 Twilio + 3 SendGrid endpoints
Internal tooling:  5 GitHub API endpoints

Every one of these was a potential source of silent schema drift.

Baseline Capture

For each endpoint, Rumbliq captured a baseline schema — a recursive representation of the JSON structure: field names, types, nesting, and optionality. This baseline is the "expected state" that all future responses are compared against.

The baseline capture happens automatically on the first check. The team verified each baseline against their existing API integration code to confirm Rumbliq had captured the correct response structure.

Alert Routing

The team mapped their existing PagerDuty services and Slack channels to Rumbliq alert destinations:

Removals and type changes → PagerDuty api-integrations service (pages the on-call engineer)
Additions → #api-drift-low-priority Slack channel (informational, no page)

This routing prevented alert fatigue while ensuring breaking changes got immediate attention.

What Happened in the First 30 Days

The team expected to see a quiet period — they'd dealt with three incidents, presumably the third-party APIs were stable now.

Instead, Rumbliq detected 7 schema changes in the first 30 days:

Day	Vendor	Change	Severity
3	Auth0	New optional field `last_password_reset` added to user profile	Low
8	Stripe	`payment_method_details` object gained a new `klarna` sub-object	Low
11	Salesforce	`Account.Industry` enum expanded with 2 new values	Medium
17	Twilio	`price` field type changed from string to number on message responses	Critical
22	Auth0	`user_metadata` nesting changed for custom attribute storage	Critical
26	SendGrid	`message_id` field renamed to `x-message-id` in response headers	Critical
29	GitHub	New `visibility` field added to repository webhook events	Low

Three of those 7 changes were critical — meaning they would have caused application errors if deployed code had encountered them. Two (Twilio type change, SendGrid rename) directly matched the pattern of their previous incidents.

None of the three critical changes caused an incident, because Rumbliq caught them before they hit production code.

The Reliability Impact

The team tracked a simple metric: undetected third-party API schema changes that caused production incidents.

Q3 2024 (before Rumbliq): 3 incidents Q4 2024 (first 30 days of Rumbliq): 0 incidents Q1 2025 (full quarter with Rumbliq): 0 incidents

That's a 100% reduction in API-drift-related incidents, in the first quarter.

The team's overall uptime on external API integrations moved from approximately 99.94% (accounting for the three incidents) to 99.99% — a 10x improvement in the gap between current and four-nines.

The Operational Workflow

The team standardized a new workflow for handling schema drift alerts:

When a critical alert fires (field removal, type change):

On-call engineer acknowledges in PagerDuty
Reviews the Rumbliq diff to understand exactly what changed
Searches the codebase for all usages of the changed field
Creates a ticket with the impact assessment
Team prioritizes based on whether any code has been deployed that reads the changed field
If no code changes deployed yet: schedule update in next sprint
If changed code deployed: incident response → hotfix → deploy

When a low-priority alert fires (field addition):

Alert lands in #api-drift-low-priority
Engineer reviews during normal working hours
Updates integration code if needed (rarely required for pure additions)
Notes the change in the vendor's changelog for context

This workflow is lightweight for the common case (additive changes) and clear for the dangerous case (removals/type changes). The on-call engineer doesn't have to figure out what to do — the playbook is established.

What This Stack Looks Like Today

The team's full observability picture:

# Third-party API monitoring stack

Uptime monitoring:
  tool: Datadog Synthetics
  checks: HTTP status codes, response time SLOs
  interval: 1 minute

Schema drift monitoring:
  tool: Rumbliq
  checks: JSON response schema against baseline
  interval: 5 minutes (critical APIs), 15 minutes (lower-priority)
  alert routing:
    critical: PagerDuty api-integrations service
    informational: Slack #api-drift-low-priority

Error tracking:
  tool: Sentry
  role: catch application errors AFTER they occur (backstop)

Incident communication:
  tool: StatusPage
  role: customer-facing incident status

Each tool has a distinct role. The addition of Rumbliq didn't replace any of the existing tools — it covered a gap that none of them addressed.

Lessons Learned

Schema drift is common. Before adding monitoring, the team assumed third-party API changes were rare events. 7 changes in 30 days, across 8 well-known enterprise APIs, was surprising. Most were additive (safe), but the frequency of critical changes was higher than anyone expected.

Silent changes are the dangerous ones. Every critical change Rumbliq caught was silent — not in the vendor's changelog, not announced via email, not visible in status pages. These are exactly the changes that cause incidents.

The diff is what matters. Knowing that "Stripe changed" is not actionable. Knowing that payment_method_details.type changed from string to enum with these specific values is immediately actionable. The field-level diff is what converts an alert from noise into a task.

Proactive monitoring changes incident response culture. When the team knows they'll catch schema changes before they cause incidents, they spend less time firefighting and more time on planned work. That cultural shift is harder to measure than uptime numbers, but it's real.

Getting Started

If your team has had API-related incidents that traced back to third-party changes, the path to eliminating them is:

Inventory your third-party API endpoints — all of them
Set up schema monitoring with Rumbliq (free tier covers up to 25 endpoints)
Establish alert routing to your existing on-call stack
Build a lightweight workflow for handling drift alerts

The full setup, for a team with 20-30 third-party endpoints, takes a day. The maintenance overhead is near zero. And the next time a vendor silently changes their API schema, you'll know about it in minutes instead of hours — or never.

Start monitoring your APIs free at rumbliq.com — no credit card required

Related reading:

What is API schema drift? — the foundational concept behind the monitoring layer described in this post
How to detect breaking API changes automatically — technical implementation of schema drift detection
API observability guide 2026 — how schema monitoring fits into the broader observability picture
API alerting best practices — configuring severity-based routing to avoid alert fatigue
Monitoring 50+ microservice APIs with Rumbliq — scaling schema monitoring across enterprise API portfolios
How we caught a breaking Stripe API change before it hit production — a real-world example of the monitoring pattern described here