# What This Guide Covers#
A reliable n8n data sync workflow is not just “fetch data, insert rows”. Production syncs fail due to pagination bugs, timestamp edge cases, duplicate deliveries, and silent drift that only shows up in dashboards weeks later.
This guide shows a robust design for:
- Cursor-based pagination that never skips or duplicates pages
- Incremental loads with safe high-watermarks and overlap windows
- Idempotency keys and upserts for safe re-runs
- Deduplication storage that survives restarts
- CDC patterns in n8n, including webhook-based and poll-based
- Monitoring metrics that detect drift before stakeholders do
If you need baseline API patterns first, read API Integration Guide. For reliability patterns like retries and alerts, see n8n Error Handling, Retries, and Alerting. For metrics and tracing concepts, use Web App Observability Guide.
# Data Sync Reliability Requirements#
Most teams discover reliability requirements after the first incident. Make them explicit up front.
What “Reliable Sync” Means in Practice#
A reliable sync should meet these conditions:
- No data loss: every eligible record is eventually synced.
- No duplicates: downstream does not double count events or entities.
- Idempotent re-runs: you can replay any time window safely.
- Deterministic pagination: new records appearing mid-sync do not reshuffle pages.
- Detectable drift: you measure divergence between source and target.
Failure Modes You Must Design For#
| Failure mode | Symptom | Root cause | Mitigation |
|---|---|---|---|
| Skipped records | Gaps in target | Offset pagination + concurrent inserts | Cursor pagination + stable sort |
| Duplicate records | Inflation in counts | Retries replay same page | Idempotency keys + upserts |
| Missed updates | Stale fields | Timestamp watermark moves too far | Overlap window + tie-breaker |
| Partial loads | Only some pages | Timeout, rate limits, memory | Batch commits + backoff + streaming pages |
| Silent drift | Slowly diverging totals | API filtering changes, schema changes | Metrics, reconciliation checks, alerts |
🎯 Key Takeaway: Reliability is mostly about state and determinism: you need durable watermarks, stable pagination, and idempotent writes so retries become safe.
# Choose the Right Sync Strategy#
Different sources and APIs require different approaches. Pick the simplest strategy that guarantees correctness.
Snapshot, Incremental, or CDC#
| Strategy | When to use | Pros | Cons |
|---|---|---|---|
| Full snapshot (replace) | Small datasets, daily refresh | Simple logic | Expensive, high risk of rate limits |
| Incremental load (poll) | APIs with updated_at | Efficient | Timestamp edge cases |
| CDC (webhook / log) | High-volume changes, low latency | Near-real-time | Requires change feed correctness |
A practical hybrid is common: initial snapshot followed by incremental updates, plus webhooks if the provider supports them.
# Cursor-Based Pagination That Does Not Break#
Offset pagination is attractive but dangerous. With offset, if new rows are inserted at the beginning during your sync, your “page 2” shifts and you skip or duplicate records.
Cursor pagination avoids that by using a stable cursor token or last-seen field.
What to Ask of an API#
A robust paginated endpoint ideally supports:
- Sorting by a stable field, typically
updated_atthenid - Returning a
next_cursortoken, or allowingstarting_afterstyle cursors - Consistent results for a given cursor
- Explicit page size limits and rate limit headers
If you control the API, implement cursor pagination. If not, you can often emulate it using updated_at and a tie-breaker.
Example Pagination Loop in n8n#
This pattern uses:
- A stored cursor state in a database table
- Loop: fetch page, process items, update cursor, repeat until no cursor
You can implement the loop with Split In Batches or by re-invoking a subworkflow, but the cleanest approach is “request, then IF has next cursor, set cursor, repeat”.
Data model for cursor state
| Field | Type | Meaning |
|---|---|---|
sync_name | text | Identifier for this sync, like crm_contacts |
cursor | text | Opaque cursor token or last-seen marker |
updated_at | timestamp | Last successful advancement time |
HTTP Request with cursor
{
"method": "GET",
"url": "https://api.example.com/v1/contacts",
"qs": {
"limit": 200,
"cursor": "={{$json.cursor}}"
}
}Process response fields:
data: array of recordsnext_cursor: token or null
💡 Tip: Always log the cursor and the count processed per page. When debugging drift, you will want to correlate gaps to specific cursors and run IDs.
Deterministic Sorting for Cursor Emulation#
If the API does not return an opaque cursor token, you can build a cursor from fields:
- Primary:
updated_at - Tie-breaker:
id
Cursor becomes a compound marker: last processed updated_at and last processed id for that timestamp.
Avoid using only updated_at. Many systems have second-level precision, so dozens of records can share the same timestamp. Without a tie-breaker, you can loop or skip.
# Incremental Loads with Safe Watermarks#
Incremental loads reduce cost, but the watermark logic must handle late updates, clock skew, and partial failures.
High-Watermark Design#
Track:
watermark_ts: last committedupdated_atyou fully processedwatermark_id: last committedidatwatermark_ts(tie-breaker)overlap_minutes: re-read a small window to catch late-arriving updates
A common overlap window is 5 to 15 minutes. If the source is eventually consistent or has delayed indexing, use a larger overlap.
Query window
Compute a start time:
start_ts = watermark_ts - overlap
Then request records where:
updated_atis greater than or equal tostart_ts
Additionally filter in processing:
- Skip records that are strictly before the committed watermark
- For equal timestamps, only process
idgreater than watermark id
Example: Filter Logic in a Code node#
This keeps the watermark safe while allowing overlap.
const wmTs = $json.watermark_ts; // ISO string
const wmId = $json.watermark_id; // string or number
return items.filter((item) => {
const ts = item.json.updated_at;
const id = String(item.json.id);
if (ts > wmTs) return true;
if (ts < wmTs) return false;
return id > String(wmId);
});When to Advance the Watermark#
Advance the watermark only after:
- 1You successfully wrote the processed batch to the destination.
- 2You recorded dedup keys, if you rely on them.
- 3You can guarantee the run has no partial commits remaining.
If you advance the watermark immediately after fetching, a write failure will cause data loss.
⚠️ Warning: Never store “last seen” based on the last item fetched. Store “last committed” based on the last item successfully written.
# Idempotency Keys and Deduplication#
Retries and overlaps will create duplicates unless you make writes idempotent and track what you already processed.
Idempotency Keys for Writes#
If you write to an API that supports idempotency headers, use them. If you write to a database, use upsert constraints.
Examples:
idempotency_key = source + ":" + entity + ":" + entity_id + ":" + updated_at- For events: include event type and event timestamp.
Make sure the key changes when the record changes. If you only use entity_id, you can accidentally block legitimate updates.
Upsert Patterns#
For a SQL destination, create a unique constraint on the natural key:
- Contacts:
source_contact_id - Invoices:
source_invoice_id - Line items:
source_invoice_id + source_line_id
Then use upsert so duplicates become updates.
Example Postgres upsert SQL:
INSERT INTO crm_contacts (source_id, email, name, updated_at)
VALUES ($1, $2, $3, $4)
ON CONFLICT (source_id)
DO UPDATE SET
email = EXCLUDED.email,
name = EXCLUDED.name,
updated_at = EXCLUDED.updated_at;In n8n, run this via Postgres node or a generic database node.
Deduplication Storage That Survives Restarts#
Dedup in memory is not enough. n8n can restart, runs can overlap, and retries can happen days later.
Use one of these state stores:
| Store | Best for | Pros | Cons |
|---|---|---|---|
| Postgres table | Most production syncs | Durable, queryable, backups | Needs cleanup strategy |
| Redis set with TTL | High volume event dedup | Fast, built-in expiry | Operational dependency |
| n8n workflow static data | Small low-risk syncs | Simple | Not safe for high volume, tricky migrations |
A practical schema for dedup keys:
| Field | Type | Notes |
|---|---|---|
sync_name | text | crm_contacts |
dedup_key | text | hashed idempotency key |
first_seen_at | timestamp | when processed |
expires_at | timestamp | TTL cleanup |
Dedup TTL depends on your overlap window and retry policy. If you overlap 15 minutes and can retry up to 24 hours, keep TTL at least 24 to 72 hours.
ℹ️ Note: If your destination upsert is truly correct, dedup becomes a cost optimization. If your destination cannot upsert cleanly, dedup becomes a correctness requirement.
# CDC in n8n: Practical Patterns#
CDC means you process changes, not snapshots. In n8n you can approximate CDC in three common ways.
Pattern 1: Provider Webhooks#
If the source supports webhooks, this is usually the most reliable low-latency option.
Workflow outline:
- 1Webhook trigger receives event.
- 2Validate signature and parse payload.
- 3Dedup by event id.
- 4Fetch full entity if payload is partial.
- 5Upsert into destination.
- 6Record checkpoint and metrics.
Key requirement: the provider must include a stable event id, delivery id, or sequence.
Pattern 2: Change Log Endpoint#
Some APIs expose endpoints like GET /changes?since=....
This is CDC-like and often more reliable than guessing with updated_at across many endpoints.
Design:
- Store
since_tokenorsince_ts. - Page through changes with cursor pagination.
- For each change, apply idempotent upsert or delete.
Pattern 3: Poll and Diff (Last Resort)#
If there is no updated_at and no change feed, you can periodically fetch a list and diff.
This is expensive and can be brittle, but can work for small datasets.
Use it only when:
- Total entities are low enough to fetch fully.
- The API has stable IDs.
- You can tolerate more load and latency.
# Example Workflow Designs#
These examples are intentionally generic so you can adapt them to HubSpot, Stripe, Shopify, internal CRMs, or custom APIs.
Example A: Incremental Sync with Cursor Pagination and Upsert#
Goal: Sync contacts from an API to Postgres with no misses or duplicates.
Workflow steps:
- 1Cron Trigger every 5 minutes.
- 2Get state from Postgres
sync_statetable bysync_name. - 3Set query window:
start_ts = watermark_ts - overlap. - 4HTTP Request fetch page with
updated_at >= start_tsandcursor. - 5Filter using watermark tie-breaker logic.
- 6Upsert into Postgres using
source_idunique constraint. - 7Update dedup keys table if needed.
- 8Advance cursor or watermark only after successful upsert.
- 9Loop until
next_cursoris empty. - 10Emit metrics: processed, inserted, updated, duplicates, duration.
Suggested state table:
| Field | Example |
|---|---|
sync_name | crm_contacts |
watermark_ts | 2026-04-28T12:10:00Z |
watermark_id | 983244 |
cursor | eyJwYWdlIjoyfQ... |
Example B: Webhook CDC with Dedup and Backfill#
Goal: React to changes instantly, but still guarantee eventual consistency.
Workflow steps:
- 1Webhook Trigger receives
contact.updatedevents. - 2Verify signature using a shared secret.
- 3Dedup check in Postgres
dedup_keysonevent_id. - 4Fetch entity from API to get full canonical state.
- 5Upsert into destination.
- 6Store dedup key with TTL.
- 7If failed, send to a retry workflow or queue.
Add a daily backfill incremental sync using Example A to catch webhook delivery gaps. This is common because even good providers occasionally drop or delay webhooks.
💡 Tip: Combine CDC with periodic reconciliation. Webhooks optimize latency, incremental polls guarantee completeness.
# Handling Deletes and Hard-to-Detect Changes#
Many syncs look correct until someone deletes data.
Delete Strategies#
| Source capability | Best approach | Implementation idea in n8n |
|---|---|---|
| Emits delete events | CDC delete handling | Webhook or changes feed triggers soft delete in target |
Has deleted_at | Incremental with tombstones | Treat as update, set is_deleted |
| No delete signal | Reconciliation | Periodic full ID list compare and mark missing |
For many business systems, soft deletes are safer than hard deletes. Keep a is_deleted flag and a deleted_at timestamp in your destination.
# Monitoring Metrics to Catch Drift#
If you do not measure drift, you will discover it from a finance report or a user complaint. Monitoring should be part of the workflow design, not an afterthought.
For observability fundamentals and metric naming, refer to Web App Observability Guide.
Metrics That Matter for Data Sync#
Track these per run and per sync:
| Metric | Type | Why it matters | Alert suggestion |
|---|---|---|---|
records_fetched | counter | Detect source API changes | Sudden drop to near zero |
records_processed | counter | Validate filter logic | Drop or spike compared to baseline |
records_inserted | counter | Detect growth | Unexpected spikes |
records_updated | counter | Detect churn | Sustained zero may indicate failure |
duplicates_skipped | counter | Detect overlap and retries | Spike indicates instability |
run_duration_seconds | histogram | Performance regression | greater than 2x baseline |
api_429_count | counter | Rate limiting | Any sustained 429s |
watermark_lag_seconds | gauge | How far behind you are | Greater than SLA, like 900 seconds |
error_count | counter | Reliability | Any non-zero with paging failures |
If you are using n8n self-hosted, push metrics to Prometheus via a lightweight endpoint, or log structured JSON and derive metrics in your log pipeline.
Drift Detection Checks#
Add a reconciliation workflow daily or weekly:
- Compare counts by day: source updates per day vs target updates per day
- Compare sample entity hashes: pick 100 random IDs and compare canonical fields
- Compare min and max
updated_atand the watermark lag
Example reconciliation query idea:
- Source: number of contacts updated in last 24 hours
- Target: number of rows with
updated_atin last 24 hours
If source is 10,000 and target is 8,000, you have drift and need investigation.
⚠️ Warning: Do not rely on “workflow succeeded” as a correctness signal. A workflow can succeed while silently skipping records due to filter or pagination bugs.
# Common Pitfalls and How to Avoid Them#
- 1Using offset pagination on mutable datasets — switch to cursor pagination or stable sort by
updated_atplusid. - 2Advancing watermark before commit — only commit watermark after destination write success.
- 3Using timestamps without overlap — add an overlap window and tie-breaker to avoid missing late updates.
- 4No idempotency on writes — use upserts or idempotency keys so retries are safe.
- 5No durable state — store watermark and dedup in Postgres or Redis, not only in memory.
- 6No drift monitoring — add reconciliation and metric-based alerts early.
If you need robust retry and alert patterns in n8n, implement them using the approach in n8n Error Handling, Retries, and Alerting.
# Key Takeaways#
- Prefer cursor-based pagination or stable sorting by
updated_atplusidto avoid skipped or duplicated pages. - Implement incremental loads with overlap and store a last committed watermark, not last fetched.
- Make every write idempotent using upserts and idempotency keys so retries and replays are safe.
- Persist deduplication keys in Postgres or Redis to survive restarts and prevent duplicate processing across runs.
- Add monitoring and reconciliation using metrics like watermark lag, duplicates skipped, and daily count comparisons to catch drift early.
# Conclusion#
A production-grade n8n data sync workflow is a state machine: deterministic pagination, cautious watermarks, idempotent writes, and durable dedup storage turn inevitable retries into safe replays.
If you want Samioda to design and implement a reliable sync for your CRM, billing system, or internal platform, contact us with your source and destination details and your required sync SLA. We will propose a workflow architecture, state storage, and monitoring plan that prevents drift and scales with volume.
FAQ
More in Business Automation
All →Building an n8n Approval Workflow in 2026: Slack or Teams, Email, and Audit Trails
Learn how to build a production-ready n8n approval workflow with human-in-the-loop approvals, timeouts, reminders, escalation paths, and audit logging to prevent duplicate decisions.
Lead-to-Cash Automation with n8n: From Form Submit to Invoice (End-to-End Workflow)
A practical lead to cash automation blueprint in n8n: capture leads, enrich data, route to sales, create deals, generate contracts, and trigger invoicing.
How to Self-Host n8n with Docker in 2026: Security, Backups, and Environment Setup
A practical step-by-step guide to self host n8n with Docker Compose, including persistence, secrets management, SSL, network isolation, and backup and restore procedures.
Need help with your project?
We build custom solutions using the technologies discussed in this article. Senior team, fixed prices.
Related Articles
Building an n8n Approval Workflow in 2026: Slack or Teams, Email, and Audit Trails
Learn how to build a production-ready n8n approval workflow with human-in-the-loop approvals, timeouts, reminders, escalation paths, and audit logging to prevent duplicate decisions.
How to Self-Host n8n with Docker in 2026: Security, Backups, and Environment Setup
A practical step-by-step guide to self host n8n with Docker Compose, including persistence, secrets management, SSL, network isolation, and backup and restore procedures.
n8n Error Handling in Production: Retries, Dead-Letter Flows, and Alerting
A practical guide to n8n error handling in production — including retry strategies, idempotency, partial failure patterns, dead-letter flows, and Slack or email alerting you can reuse.