From API Outage to 99.99% Uptime: Building a Monitoring Stack with Rumbliq
In Q3 2024, a developer platform team had three API-related production incidents in six weeks. None of them were caused by their own code failing. All three were caused by silent schema changes in third-party APIs they depended on — changes that broke code that had been running correctly for months.
This is the story of how they went from that state to 99.99% uptime on their external integrations, and what the monitoring stack looks like.
The Incident Pattern
The three incidents followed the same pattern:
- A third-party API silently changes something in its response
- The team's application code, which was written against the old schema, starts behaving incorrectly
- Users notice the problem — checkout fails, notifications aren't sent, data displays incorrectly
- Engineering gets paged; incident response begins
- Root cause analysis eventually traces back to a third-party API change
- Code is updated; incident is closed
- Repeat in 2-3 weeks
Each incident took 3-6 hours from first alert to resolution. The longest took 9 hours because the schema change had occurred 4 days earlier and nobody noticed until a specific edge-case triggered it.
The post-mortems all said the same thing: "We need better visibility into third-party API changes." But nobody had a concrete implementation plan.
What the Old Stack Was Missing
The team already had solid observability:
- Datadog: Infrastructure metrics, APM traces, uptime checks
- Sentry: Application error tracking with stack traces
- PagerDuty: On-call rotation and escalation policies
- StatusPage: Customer-facing incident communication
This is a legitimately good observability stack. It did exactly what it was designed to do.
What none of it did: monitor whether the shape of third-party API responses was changing.
Datadog's uptime checks verify that an endpoint returns 200. They don't check whether the JSON response contains the fields your code expects. Sentry captures errors after they happen. Neither tool watches a third-party API endpoint and tells you when its schema drifts.
The Missing Layer: Schema Drift Monitoring
The team added Rumbliq as the fourth layer of their monitoring stack, specifically focused on third-party API schema integrity.
The conceptual model:
Layer 1 — Infrastructure (Datadog)
↓ catches: service failures, latency spikes, resource exhaustion
Layer 2 — Application (Sentry)
↓ catches: runtime exceptions, unhandled errors after they occur
Layer 3 — User Experience (StatusPage + synthetic monitoring)
↓ catches: user-visible failures after they occur
Layer 4 — Schema Drift (Rumbliq) ← NEW
↓ catches: third-party API schema changes BEFORE they cause failures
The key distinction of Layer 4: it operates before failures occur. Layers 1-3 are reactive. Layer 4 is proactive.
Building the Schema Monitoring Layer
Endpoint Inventory
The first step was a complete inventory of third-party API endpoints their application code read from. This alone was revealing — the team identified 34 distinct third-party endpoints across 8 services, many of which had never been explicitly documented.
Payment flows: 8 Stripe endpoints
Auth flows: 6 Auth0 endpoints
Data sync: 7 Salesforce endpoints
Notifications: 5 Twilio + 3 SendGrid endpoints
Internal tooling: 5 GitHub API endpoints
Every one of these was a potential source of silent schema drift.
Baseline Capture
For each endpoint, Rumbliq captured a baseline schema — a recursive representation of the JSON structure: field names, types, nesting, and optionality. This baseline is the "expected state" that all future responses are compared against.
The baseline capture happens automatically on the first check. The team verified each baseline against their existing API integration code to confirm Rumbliq had captured the correct response structure.
Alert Routing
The team mapped their existing PagerDuty services and Slack channels to Rumbliq alert destinations:
- Removals and type changes → PagerDuty
api-integrationsservice (pages the on-call engineer) - Additions →
#api-drift-low-prioritySlack channel (informational, no page)
This routing prevented alert fatigue while ensuring breaking changes got immediate attention.
What Happened in the First 30 Days
The team expected to see a quiet period — they'd dealt with three incidents, presumably the third-party APIs were stable now.
Instead, Rumbliq detected 7 schema changes in the first 30 days:
| Day | Vendor | Change | Severity |
|---|---|---|---|
| 3 | Auth0 | New optional field last_password_reset added to user profile |
Low |
| 8 | Stripe | payment_method_details object gained a new klarna sub-object |
Low |
| 11 | Salesforce | Account.Industry enum expanded with 2 new values |
Medium |
| 17 | Twilio | price field type changed from string to number on message responses |
Critical |
| 22 | Auth0 | user_metadata nesting changed for custom attribute storage |
Critical |
| 26 | SendGrid | message_id field renamed to x-message-id in response headers |
Critical |
| 29 | GitHub | New visibility field added to repository webhook events |
Low |
Three of those 7 changes were critical — meaning they would have caused application errors if deployed code had encountered them. Two (Twilio type change, SendGrid rename) directly matched the pattern of their previous incidents.
None of the three critical changes caused an incident, because Rumbliq caught them before they hit production code.
The Reliability Impact
The team tracked a simple metric: undetected third-party API schema changes that caused production incidents.
Q3 2024 (before Rumbliq): 3 incidents Q4 2024 (first 30 days of Rumbliq): 0 incidents Q1 2025 (full quarter with Rumbliq): 0 incidents
That's a 100% reduction in API-drift-related incidents, in the first quarter.
The team's overall uptime on external API integrations moved from approximately 99.94% (accounting for the three incidents) to 99.99% — a 10x improvement in the gap between current and four-nines.
The Operational Workflow
The team standardized a new workflow for handling schema drift alerts:
When a critical alert fires (field removal, type change):
- On-call engineer acknowledges in PagerDuty
- Reviews the Rumbliq diff to understand exactly what changed
- Searches the codebase for all usages of the changed field
- Creates a ticket with the impact assessment
- Team prioritizes based on whether any code has been deployed that reads the changed field
- If no code changes deployed yet: schedule update in next sprint
- If changed code deployed: incident response → hotfix → deploy
When a low-priority alert fires (field addition):
- Alert lands in
#api-drift-low-priority - Engineer reviews during normal working hours
- Updates integration code if needed (rarely required for pure additions)
- Notes the change in the vendor's changelog for context
This workflow is lightweight for the common case (additive changes) and clear for the dangerous case (removals/type changes). The on-call engineer doesn't have to figure out what to do — the playbook is established.
What This Stack Looks Like Today
The team's full observability picture:
# Third-party API monitoring stack
Uptime monitoring:
tool: Datadog Synthetics
checks: HTTP status codes, response time SLOs
interval: 1 minute
Schema drift monitoring:
tool: Rumbliq
checks: JSON response schema against baseline
interval: 5 minutes (critical APIs), 15 minutes (lower-priority)
alert routing:
critical: PagerDuty api-integrations service
informational: Slack #api-drift-low-priority
Error tracking:
tool: Sentry
role: catch application errors AFTER they occur (backstop)
Incident communication:
tool: StatusPage
role: customer-facing incident status
Each tool has a distinct role. The addition of Rumbliq didn't replace any of the existing tools — it covered a gap that none of them addressed.
Lessons Learned
Schema drift is common. Before adding monitoring, the team assumed third-party API changes were rare events. 7 changes in 30 days, across 8 well-known enterprise APIs, was surprising. Most were additive (safe), but the frequency of critical changes was higher than anyone expected.
Silent changes are the dangerous ones. Every critical change Rumbliq caught was silent — not in the vendor's changelog, not announced via email, not visible in status pages. These are exactly the changes that cause incidents.
The diff is what matters. Knowing that "Stripe changed" is not actionable. Knowing that payment_method_details.type changed from string to enum with these specific values is immediately actionable. The field-level diff is what converts an alert from noise into a task.
Proactive monitoring changes incident response culture. When the team knows they'll catch schema changes before they cause incidents, they spend less time firefighting and more time on planned work. That cultural shift is harder to measure than uptime numbers, but it's real.
Getting Started
If your team has had API-related incidents that traced back to third-party changes, the path to eliminating them is:
- Inventory your third-party API endpoints — all of them
- Set up schema monitoring with Rumbliq (free tier covers up to 25 endpoints)
- Establish alert routing to your existing on-call stack
- Build a lightweight workflow for handling drift alerts
The full setup, for a team with 20-30 third-party endpoints, takes a day. The maintenance overhead is near zero. And the next time a vendor silently changes their API schema, you'll know about it in minutes instead of hours — or never.
Start monitoring your APIs free at rumbliq.com — no credit card required
Related reading:
- What is API schema drift? — the foundational concept behind the monitoring layer described in this post
- How to detect breaking API changes automatically — technical implementation of schema drift detection
- API observability guide 2026 — how schema monitoring fits into the broader observability picture
- API alerting best practices — configuring severity-based routing to avoid alert fatigue
- Monitoring 50+ microservice APIs with Rumbliq — scaling schema monitoring across enterprise API portfolios
- How we caught a breaking Stripe API change before it hit production — a real-world example of the monitoring pattern described here