The Webhook Reliability Problem BaaS Providers Don't Talk About

Your BaaS provider just published their Q1 reliability report. 99.99% uptime. Four nines. The number looks great in the executive deck. It goes into your vendor risk assessment. Your SLA compliance metrics are green.

What the report doesn't mention: over the last 90 days, the same provider's payment webhook delivery had an 18% failure rate on a specific transaction type. Your reconciliation system shows clean numbers because the provider's dashboard only surfaces successful deliveries. The 847 webhooks that were silently dropped over four hours don't appear anywhere — not in your logs, not in your provider's incident tracker, not in your SLA report.

Your compliance reporting shows correct transaction data. It's not.

18% of webhook deliveries can silently fail — with no alert, no incident, and no impact on server uptime metrics

Why Webhook Reliability Is Never in the SLA

A BaaS provider's SLA covers infrastructure availability: are the servers up, is the API responding, is the dashboard accessible. This is a well-defined, well-measured metric with clear remediation paths when it's missed.

Webhook delivery is none of those things. It's a distributed systems problem: your endpoint is outside the provider's infrastructure, delivery requires network transit, and failure modes include everything the provider can't see or control. So they don't promise it.

What providers typically offer instead:

"Best effort" delivery — delivery is attempted, not guaranteed
A retry window — typically 24–72 hours, after which missed events are gone
Deduplication attempts — but deduplication logic varies wildly and is rarely documented
No delivery confirmation — the provider sends; whether it arrives is not their problem

When a webhook is silently dropped, the provider's systems show no error. Their uptime is fine. Their API responds correctly. The webhook just... didn't arrive. And because there's no alert for events that were never sent, your monitoring system shows nothing wrong.

Your BaaS provider's uptime SLA measures whether their servers respond. It says nothing about whether your endpoint received the payment notification that triggered your accounting entry, your fraud signal, or your compliance record.

How Webhooks Actually Fail

Webhook failures in embedded finance aren't a single failure mode — they're a family of them. Each is silent by default and catastrophic in aggregate.

Silent Drops

The provider sends the webhook. Your endpoint never receives it. This can happen due to network routing failures, incorrect endpoint configuration, firewall rules that silently discard incoming POSTs, or CDN-level packet drops that your infrastructure never logs.

You don't find out about these. Your provider sends. Your endpoint doesn't receive. Nobody has a record of the gap.

Retry Storm Deduplication

Most BaaS providers implement webhook retries: if the first delivery fails, retry in 1 minute, then 5 minutes, then 30 minutes, then 2 hours. This sounds robust. The problem is deduplication.

When a webhook finally arrives after multiple retries, your system may already have processed the event. If your deduplication logic is based on an event_id that isn't present in all event types — or that the provider changed in a minor version update — you process the event twice. Your accounting system records the same transaction twice. Reconciliation shows a discrepancy you can't explain.

Late Delivery Beyond Your SLA Window

Some webhook failures aren't immediate — they're delayed. A webhook that should arrive within 5 seconds instead arrives 3 hours later because the provider's queue was backed up and your endpoint was deprioritized during a traffic spike.

By the time the event arrives, your background job already ran. Your fraud model already cleared the transaction. Your reconciliation already closed the books. The late event has no handler, so it gets logged and dropped.

The transaction happened. Your records are wrong. Nobody knows.

Schema Changes Between Retry Attempts

Your deduplication key is the transaction_id. A webhook fires and fails. During the retry window, your BaaS provider pushes a schema update to their event payload — new field added, type changed, nested structure flattened. The retry arrives with a slightly different payload. Your deduplication check passes. Your event handler tries to process it and fails silently because a field it expects isn't there anymore.

Stay ahead

Get weekly insights on embedded finance reliability and webhook monitoring best practices.

✓ You're in

🩺

Free Tool

How healthy is your BaaS integration?

8 questions. 2 minutes. Get your score.

Check Your Score →

The Compliance Gap Nobody Sees

Embedded finance platforms are subject to compliance requirements that depend on complete, accurate transaction data. AML monitoring, transaction screening, and regulatory reporting all require that every transaction event is recorded and reconciled.

When webhooks are silently dropped, the compliance system operates on incomplete data. The fraud model doesn't see the transaction. The AML system doesn't flag the activity. The regulatory report is missing the event. From the outside, the system looks healthy — no errors, no incidents, all metrics green.

Until a regulator asks for the complete transaction history for a specific account during a routine audit. Then you discover that 12% of your transaction events for the period are missing from your records, and there's no way to reconstruct them from the provider's API because they only surface current state — not the event log you needed.

The Reconciliation Problem

If your reconciliation process is webhook-driven — processing events as they arrive and reconciling against the provider's balance — you have a structural blind spot. Events that never arrived never reconciled. Your closing balance looks correct because you're only reconciling against what you received.

Proper reconciliation requires independent transaction discovery: querying the provider's ledger directly for a complete transaction list, then diffing against what your system processed. If your reconciliation only ever sees webhook-driven events, you're reconciling against an incomplete dataset and calling it accurate.

What Webhook Monitoring Actually Requires

The solution isn't more retries or better deduplication logic in your event handler. The solution is inbound webhook observability: knowing what arrived, what didn't, and why.

Event Log Reconciliation

Maintain an independent record of every event your system should have received — not just the ones that did arrive. This means periodically querying the provider's ledger for the complete event list, then diffing against your processed events. Events in the provider's log that aren't in your processed log are the webhooks that never arrived.

This is the only way to detect silent drops. Your monitoring system won't catch them — nothing arrives to trigger an alert. You need an independent source of truth.

Delivery Latency Tracking

Track the time between when an event occurs (from the provider's event timestamp) and when it arrives in your system. A webhook that arrives 3 hours late looks identical to a webhook that arrived immediately — unless you're measuring the gap.

Set thresholds on acceptable delivery latency. When P95 delivery latency exceeds your threshold, that's an alert — even if every event eventually arrives.

Payload Schema Validation

When a webhook arrives, validate it against your expected schema before processing. If the provider pushed a schema change, the validation failure is your signal — not a silent crash three layers deep in your event handler.

Schema validation catches the problem at the edge: you know immediately that a webhook arrived in an unexpected format, rather than discovering it hours later during reconciliation.

Retry and Deduplication Audit

Log every webhook attempt, not just successful deliveries. If the provider retried an event five times before it arrived, you should see all five attempts in your logs. This gives you visibility into the provider's retry behavior and lets you identify patterns: which event types trigger retries, how often, with what latency between attempts.

Can your team detect a 15% webhook drop before it hits reconciliation?

Most embedded finance teams have no visibility into webhook delivery reliability. Here's what to ask your team:

How would you know if 15% of your payment webhooks were silently dropped over the last 30 days?
What's your detection strategy for late-arriving webhooks that should have processed during a reconciliation window?
When a provider pushes a schema change, how do you find out before your event handler fails?
How do you reconstruct transaction history if a webhook was dropped and your system only stores processed events?
What's your process for validating that your deduplication logic covers all event types and provider schema versions?

Conduit provides continuous webhook monitoring for embedded finance teams — delivery confirmation, latency tracking, schema validation, and ledger-based event reconciliation. If you need production-grade webhook reliability monitoring, here's where to start.

See Conduit in action →

Webhook delivery reliability is the unmeasured gap in most embedded finance stacks. Your provider doesn't promise it because they can't control it. Your monitoring system doesn't catch it because nothing arrives to trigger an alert. And your compliance reporting shows clean numbers because the missing events are invisible.

The teams that get this right treat webhooks as a second data source — not the source of truth. They reconcile independently, measure delivery latency, validate incoming payloads, and catch the gaps before they compound into compliance problems.

Server uptime is the floor. Webhook reliability is where your actual risk lives.