The Integration That's 'Almost Working' — Why It Persists and How to Fix It

The hardest category of L3 problem isn't the broken integration — it's the almost working integration. The Shopify-to-Quickbooks sync that handles 95% of orders correctly. The Mailchimp-to-Pipedrive flow that mostly captures leads. The Zapier automation that fires reliably 9 times out of 10.

The problem with "almost working": you have it. You depend on it. You assume it's reliable. Six months later, your data is silently inconsistent — hundreds of missing records, wrong values in random fields, edge cases that have been quietly accumulating since day one. The cleanup is expensive. The trust in the integration is broken. Replacing it requires migration. Living with it requires constant manual reconciliation.

This article is about why these integrations persist in their broken state, how to identify the failure pattern, and the four-step fix process.

Why "almost working" is worse than "totally broken"

A totally broken integration triggers immediate response. Sync fails on day one, you fix it day two, life goes on.

An almost-working integration triggers nothing for a long time. Sync works for the first 50 orders flawlessly. Then on order 51, an edge case fires — a customer with an apostrophe in their name, an order with a discount code that exceeds the order total, a refund that's being processed simultaneously with an inventory update. The integration silently drops the data. Or syncs it incorrectly. No error is logged. No alert is sent.

By order 500, you have 10 silent failures. By order 5,000, you have 100. By the time you notice (because someone in finance is doing a reconciliation), the historical fix is a multi-day forensic project.

The "almost" is the trap. It looks reliable enough that you stop monitoring it. The 5% failure rate accumulates compounding problems while you're not looking.

The five common failure patterns

In rough order of how often I see them:

1. Edge case data that doesn't fit the integration's schema

The integration was built for the common case. When a record has unusual data — extra-long text fields, special characters, unusual currency formats, missing required fields — the integration fails silently or corrupts the data.

Example: Shopify-to-Xero sync handles standard product names fine. A product with an emoji in the name (now common with marketing-driven naming) fails because Xero's API rejects the character. Shopify shows "synced," Xero never receives it.

2. Race conditions and timing issues

Two events happen simultaneously and the integration handles them in the wrong order.

Example: a customer places an order; webhook fires to sync it; before the sync completes, the customer cancels the order; webhook fires to sync the cancellation; cancellation arrives at the destination before the original order; cancellation fails because the order doesn't exist yet. Result: order exists in destination, cancellation never propagates.

3. API rate limits or temporary failures

The destination API rejects requests during high-volume periods or maintenance windows. The integration doesn't retry, or retries with the wrong backoff strategy.

Example: Black Friday traffic causes Stripe → analytics integration to fail for 90 minutes. Without retry logic, those 1,200 transactions are lost from analytics.

4. Authentication tokens silently expiring

OAuth tokens for the integration expire and don't auto-refresh. The integration "fails" but the failure looks like "no data syncing" rather than an explicit error. Discovery is delayed by however long until someone notices the absence of new data.

5. Vendor-side changes the integration didn't anticipate

Either the source or destination service changes its API, removes a field, deprecates an endpoint. The integration was built for the old behaviour. When the change rolls out, the integration breaks for some subset of operations.

Example: Mailchimp deprecates a particular merge field type. Existing data still works, but new records using that field type are silently dropped. The integration "works" but new data isn't flowing.

The four-step fix

Step 1: Identify the failure pattern

Don't try to fix the integration before you understand what is failing.

Pull a sample of data from both sides:

100 records from the source system (e.g., Shopify orders from the last 30 days)
The corresponding 100 records from the destination (Quickbooks invoices)

Compare row-by-row. Where do they not match?

The discrepancies tell you the pattern:

Missing entirely → integration didn't fire (auth issue, rate limit, race condition)
Present but with wrong values → field mapping issue (likely edge case)
Present but stale → integration fired but didn't update on subsequent changes
Duplicate → integration fired multiple times for the same source record

The pattern points to the cause. Without this analysis, you're guessing.

Step 2: Decide between fix, replace, and middleware

Three options based on what you found:

Option A: Fix the existing integration. Right when the problem is small (specific edge case), the integration is otherwise reliable, and the vendor is responsive to bug reports. Submit the issue with reproducible examples; vendors usually fix bugs in 2–8 weeks.

Option B: Replace the integration. Right when multiple failure patterns are present, the integration vendor is unresponsive, or the integration is fundamentally architected wrong (e.g., polling-based when it should be webhook-based). Migrate to a different integration vendor. Cost: 1–4 weeks of work plus migration.

Option C: Add middleware. Right when the source and destination integrations are mostly fine but specific edge cases need custom handling. Insert a small custom service (Cloudflare Workers, AWS Lambda, a tiny Node app) between source and destination that handles the edge cases.

Most SMBs default to Option A, then waste months trying to make a flawed integration work, then eventually move to Option B or C anyway. The right call is often C — middleware — for problems that don't justify a full replacement.

Step 3: Implement with reconciliation

Whichever option you choose, build in reconciliation. After a period of operation:

Pull source records from a known time window
Pull destination records from the same window
Compare. Identify discrepancies. Fix them.

Reconciliation should run weekly initially, monthly after stability is established. It catches drift before it accumulates.

Step 4: Set up monitoring

Three things to monitor going forward:

Sync success rate: percentage of source events that successfully reach destination. Should be 99%+.
Sync latency: how long between source event and destination update. Should be steady; sudden increases indicate problems.
Reconciliation report drift: the count of discrepancies found by your reconciliation runs. Should trend toward zero.

Most SMBs have no integration monitoring — they only know about problems when someone manually notices missing data. With monitoring, problems surface in hours instead of months.

The middleware pattern in detail

When middleware is the right answer, here's what it usually looks like:

Source webhooks fire from Shopify (or whatever).

Middleware service (a few hundred lines of code on Cloudflare Workers, Lambda, or a tiny app on Render) receives the webhooks. It:

Logs the incoming event
Transforms the data to handle edge cases (escapes special characters, applies field mappings, splits oversized records)
Calls the destination API with the transformed data
Logs success/failure with full context
Retries failed operations with exponential backoff

Destination service receives clean, properly-formatted data.

Why this pattern works: the source webhook does what it does (you can't change Shopify). The destination API does what it does (you can't change Quickbooks). The middleware is the only place you control. You can fix any edge case there, log everything, and have full visibility.

Middleware is usually 100–500 lines of code. Cost to deploy on Cloudflare Workers: $0 for low volume, $5/month for moderate volume. Time to build: 1–3 days for someone experienced with the source/destination APIs.

When to hire help

Integration debugging and middleware development is L3 work that benefits from someone who's done it before:

You can't identify the failure pattern from your data analysis (Step 1).
The integration involves three or more systems and the failure is in the middle.
You need to build middleware and your team doesn't have API integration experience.
The integration is high-volume (thousands of events per day) and reliability is business-critical.

The Lead Steer monthly retainer handles integration work as part of ongoing L3 support. Most integration fixes take 5–15 hours from diagnosis through middleware deployment.

A practical example

Concrete example to make this less abstract:

Situation: An SMB e-commerce store on Shopify with a Quickbooks integration via a popular third-party app. Daily reconciliation shows ~2% of orders missing from Quickbooks. About 8 missing per week. Has been happening for 9 months.

Step 1 (identify pattern): Pull missing orders from Shopify, look for commonalities. Pattern emerges: all missing orders are international (non-USD currency) with multi-line items. The integration handles single-currency multi-line and multi-currency single-line, but fails on multi-currency multi-line.

Step 2 (decide): The integration vendor's support says they're "aware of the issue" and it'll be fixed "in a future release." Six months of "future release" promises later, no fix. Decision: middleware.

Step 3 (implement): Build a Cloudflare Worker that receives Shopify webhooks, detects multi-currency multi-line orders, transforms them into a format Quickbooks can handle, and forwards to the standard integration. The standard integration handles everything else as before. Total build: 2 days, $0 ongoing cost.

Step 4 (monitor): Cloudflare Workers logs every transformation. Weekly reconciliation script compares Shopify and Quickbooks order counts. Drift goes from 8/week to 0/week immediately and stays at 0.

Total cost: about 12 hours of work. Saved: 9 months of accumulating reconciliation drift, plus all future drift the integration would have caused.

What to do next

The companion articles cover other recurring L3 problems:

The Backup You Have But Probably Can't Restore
Server Suddenly Slow? The Diagnostic Tree
The Eight-Question Brief That Saves $10,000 in Offshore Hiring Mistakes — the same brief discipline applies to scoping integration work

---

Part of the Level 3 Tech Support pillar guide.