Business Automation
n8nAutomationETLData SyncAPIsObservability

Reliable Data Sync in n8n: Pagination, Incremental Loads, Deduplication, and CDC

AO
Adrijan Omićević
·14 min read

# What This Guide Covers#

A reliable n8n data sync workflow is not just “fetch data, insert rows”. Production syncs fail due to pagination bugs, timestamp edge cases, duplicate deliveries, and silent drift that only shows up in dashboards weeks later.

This guide shows a robust design for:

  • Cursor-based pagination that never skips or duplicates pages
  • Incremental loads with safe high-watermarks and overlap windows
  • Idempotency keys and upserts for safe re-runs
  • Deduplication storage that survives restarts
  • CDC patterns in n8n, including webhook-based and poll-based
  • Monitoring metrics that detect drift before stakeholders do

If you need baseline API patterns first, read API Integration Guide. For reliability patterns like retries and alerts, see n8n Error Handling, Retries, and Alerting. For metrics and tracing concepts, use Web App Observability Guide.

# Data Sync Reliability Requirements#

Most teams discover reliability requirements after the first incident. Make them explicit up front.

What “Reliable Sync” Means in Practice#

A reliable sync should meet these conditions:

  • No data loss: every eligible record is eventually synced.
  • No duplicates: downstream does not double count events or entities.
  • Idempotent re-runs: you can replay any time window safely.
  • Deterministic pagination: new records appearing mid-sync do not reshuffle pages.
  • Detectable drift: you measure divergence between source and target.

Failure Modes You Must Design For#

Failure modeSymptomRoot causeMitigation
Skipped recordsGaps in targetOffset pagination + concurrent insertsCursor pagination + stable sort
Duplicate recordsInflation in countsRetries replay same pageIdempotency keys + upserts
Missed updatesStale fieldsTimestamp watermark moves too farOverlap window + tie-breaker
Partial loadsOnly some pagesTimeout, rate limits, memoryBatch commits + backoff + streaming pages
Silent driftSlowly diverging totalsAPI filtering changes, schema changesMetrics, reconciliation checks, alerts

🎯 Key Takeaway: Reliability is mostly about state and determinism: you need durable watermarks, stable pagination, and idempotent writes so retries become safe.

# Choose the Right Sync Strategy#

Different sources and APIs require different approaches. Pick the simplest strategy that guarantees correctness.

Snapshot, Incremental, or CDC#

StrategyWhen to useProsCons
Full snapshot (replace)Small datasets, daily refreshSimple logicExpensive, high risk of rate limits
Incremental load (poll)APIs with updated_atEfficientTimestamp edge cases
CDC (webhook / log)High-volume changes, low latencyNear-real-timeRequires change feed correctness

A practical hybrid is common: initial snapshot followed by incremental updates, plus webhooks if the provider supports them.

# Cursor-Based Pagination That Does Not Break#

Offset pagination is attractive but dangerous. With offset, if new rows are inserted at the beginning during your sync, your “page 2” shifts and you skip or duplicate records.

Cursor pagination avoids that by using a stable cursor token or last-seen field.

What to Ask of an API#

A robust paginated endpoint ideally supports:

  • Sorting by a stable field, typically updated_at then id
  • Returning a next_cursor token, or allowing starting_after style cursors
  • Consistent results for a given cursor
  • Explicit page size limits and rate limit headers

If you control the API, implement cursor pagination. If not, you can often emulate it using updated_at and a tie-breaker.

Example Pagination Loop in n8n#

This pattern uses:

  • A stored cursor state in a database table
  • Loop: fetch page, process items, update cursor, repeat until no cursor

You can implement the loop with Split In Batches or by re-invoking a subworkflow, but the cleanest approach is “request, then IF has next cursor, set cursor, repeat”.

Data model for cursor state

FieldTypeMeaning
sync_nametextIdentifier for this sync, like crm_contacts
cursortextOpaque cursor token or last-seen marker
updated_attimestampLast successful advancement time

HTTP Request with cursor

JSON
{
  "method": "GET",
  "url": "https://api.example.com/v1/contacts",
  "qs": {
    "limit": 200,
    "cursor": "={{$json.cursor}}"
  }
}

Process response fields:

  • data: array of records
  • next_cursor: token or null

💡 Tip: Always log the cursor and the count processed per page. When debugging drift, you will want to correlate gaps to specific cursors and run IDs.

Deterministic Sorting for Cursor Emulation#

If the API does not return an opaque cursor token, you can build a cursor from fields:

  • Primary: updated_at
  • Tie-breaker: id

Cursor becomes a compound marker: last processed updated_at and last processed id for that timestamp.

Avoid using only updated_at. Many systems have second-level precision, so dozens of records can share the same timestamp. Without a tie-breaker, you can loop or skip.

# Incremental Loads with Safe Watermarks#

Incremental loads reduce cost, but the watermark logic must handle late updates, clock skew, and partial failures.

High-Watermark Design#

Track:

  • watermark_ts: last committed updated_at you fully processed
  • watermark_id: last committed id at watermark_ts (tie-breaker)
  • overlap_minutes: re-read a small window to catch late-arriving updates

A common overlap window is 5 to 15 minutes. If the source is eventually consistent or has delayed indexing, use a larger overlap.

Query window

Compute a start time:

  • start_ts = watermark_ts - overlap

Then request records where:

  • updated_at is greater than or equal to start_ts

Additionally filter in processing:

  • Skip records that are strictly before the committed watermark
  • For equal timestamps, only process id greater than watermark id

Example: Filter Logic in a Code node#

This keeps the watermark safe while allowing overlap.

JavaScript
const wmTs = $json.watermark_ts; // ISO string
const wmId = $json.watermark_id; // string or number
 
return items.filter((item) => {
  const ts = item.json.updated_at;
  const id = String(item.json.id);
 
  if (ts > wmTs) return true;
  if (ts < wmTs) return false;
  return id > String(wmId);
});

When to Advance the Watermark#

Advance the watermark only after:

  1. 1
    You successfully wrote the processed batch to the destination.
  2. 2
    You recorded dedup keys, if you rely on them.
  3. 3
    You can guarantee the run has no partial commits remaining.

If you advance the watermark immediately after fetching, a write failure will cause data loss.

⚠️ Warning: Never store “last seen” based on the last item fetched. Store “last committed” based on the last item successfully written.

# Idempotency Keys and Deduplication#

Retries and overlaps will create duplicates unless you make writes idempotent and track what you already processed.

Idempotency Keys for Writes#

If you write to an API that supports idempotency headers, use them. If you write to a database, use upsert constraints.

Examples:

  • idempotency_key = source + ":" + entity + ":" + entity_id + ":" + updated_at
  • For events: include event type and event timestamp.

Make sure the key changes when the record changes. If you only use entity_id, you can accidentally block legitimate updates.

Upsert Patterns#

For a SQL destination, create a unique constraint on the natural key:

  • Contacts: source_contact_id
  • Invoices: source_invoice_id
  • Line items: source_invoice_id + source_line_id

Then use upsert so duplicates become updates.

Example Postgres upsert SQL:

SQL
INSERT INTO crm_contacts (source_id, email, name, updated_at)
VALUES ($1, $2, $3, $4)
ON CONFLICT (source_id)
DO UPDATE SET
  email = EXCLUDED.email,
  name = EXCLUDED.name,
  updated_at = EXCLUDED.updated_at;

In n8n, run this via Postgres node or a generic database node.

Deduplication Storage That Survives Restarts#

Dedup in memory is not enough. n8n can restart, runs can overlap, and retries can happen days later.

Use one of these state stores:

StoreBest forProsCons
Postgres tableMost production syncsDurable, queryable, backupsNeeds cleanup strategy
Redis set with TTLHigh volume event dedupFast, built-in expiryOperational dependency
n8n workflow static dataSmall low-risk syncsSimpleNot safe for high volume, tricky migrations

A practical schema for dedup keys:

FieldTypeNotes
sync_nametextcrm_contacts
dedup_keytexthashed idempotency key
first_seen_attimestampwhen processed
expires_attimestampTTL cleanup

Dedup TTL depends on your overlap window and retry policy. If you overlap 15 minutes and can retry up to 24 hours, keep TTL at least 24 to 72 hours.

ℹ️ Note: If your destination upsert is truly correct, dedup becomes a cost optimization. If your destination cannot upsert cleanly, dedup becomes a correctness requirement.

# CDC in n8n: Practical Patterns#

CDC means you process changes, not snapshots. In n8n you can approximate CDC in three common ways.

Pattern 1: Provider Webhooks#

If the source supports webhooks, this is usually the most reliable low-latency option.

Workflow outline:

  1. 1
    Webhook trigger receives event.
  2. 2
    Validate signature and parse payload.
  3. 3
    Dedup by event id.
  4. 4
    Fetch full entity if payload is partial.
  5. 5
    Upsert into destination.
  6. 6
    Record checkpoint and metrics.

Key requirement: the provider must include a stable event id, delivery id, or sequence.

Pattern 2: Change Log Endpoint#

Some APIs expose endpoints like GET /changes?since=....

This is CDC-like and often more reliable than guessing with updated_at across many endpoints.

Design:

  • Store since_token or since_ts.
  • Page through changes with cursor pagination.
  • For each change, apply idempotent upsert or delete.

Pattern 3: Poll and Diff (Last Resort)#

If there is no updated_at and no change feed, you can periodically fetch a list and diff.

This is expensive and can be brittle, but can work for small datasets.

Use it only when:

  • Total entities are low enough to fetch fully.
  • The API has stable IDs.
  • You can tolerate more load and latency.

# Example Workflow Designs#

These examples are intentionally generic so you can adapt them to HubSpot, Stripe, Shopify, internal CRMs, or custom APIs.

Example A: Incremental Sync with Cursor Pagination and Upsert#

Goal: Sync contacts from an API to Postgres with no misses or duplicates.

Workflow steps:

  1. 1
    Cron Trigger every 5 minutes.
  2. 2
    Get state from Postgres sync_state table by sync_name.
  3. 3
    Set query window: start_ts = watermark_ts - overlap.
  4. 4
    HTTP Request fetch page with updated_at >= start_ts and cursor.
  5. 5
    Filter using watermark tie-breaker logic.
  6. 6
    Upsert into Postgres using source_id unique constraint.
  7. 7
    Update dedup keys table if needed.
  8. 8
    Advance cursor or watermark only after successful upsert.
  9. 9
    Loop until next_cursor is empty.
  10. 10
    Emit metrics: processed, inserted, updated, duplicates, duration.

Suggested state table:

FieldExample
sync_namecrm_contacts
watermark_ts2026-04-28T12:10:00Z
watermark_id983244
cursoreyJwYWdlIjoyfQ...

Example B: Webhook CDC with Dedup and Backfill#

Goal: React to changes instantly, but still guarantee eventual consistency.

Workflow steps:

  1. 1
    Webhook Trigger receives contact.updated events.
  2. 2
    Verify signature using a shared secret.
  3. 3
    Dedup check in Postgres dedup_keys on event_id.
  4. 4
    Fetch entity from API to get full canonical state.
  5. 5
    Upsert into destination.
  6. 6
    Store dedup key with TTL.
  7. 7
    If failed, send to a retry workflow or queue.

Add a daily backfill incremental sync using Example A to catch webhook delivery gaps. This is common because even good providers occasionally drop or delay webhooks.

💡 Tip: Combine CDC with periodic reconciliation. Webhooks optimize latency, incremental polls guarantee completeness.

# Handling Deletes and Hard-to-Detect Changes#

Many syncs look correct until someone deletes data.

Delete Strategies#

Source capabilityBest approachImplementation idea in n8n
Emits delete eventsCDC delete handlingWebhook or changes feed triggers soft delete in target
Has deleted_atIncremental with tombstonesTreat as update, set is_deleted
No delete signalReconciliationPeriodic full ID list compare and mark missing

For many business systems, soft deletes are safer than hard deletes. Keep a is_deleted flag and a deleted_at timestamp in your destination.

# Monitoring Metrics to Catch Drift#

If you do not measure drift, you will discover it from a finance report or a user complaint. Monitoring should be part of the workflow design, not an afterthought.

For observability fundamentals and metric naming, refer to Web App Observability Guide.

Metrics That Matter for Data Sync#

Track these per run and per sync:

MetricTypeWhy it mattersAlert suggestion
records_fetchedcounterDetect source API changesSudden drop to near zero
records_processedcounterValidate filter logicDrop or spike compared to baseline
records_insertedcounterDetect growthUnexpected spikes
records_updatedcounterDetect churnSustained zero may indicate failure
duplicates_skippedcounterDetect overlap and retriesSpike indicates instability
run_duration_secondshistogramPerformance regressiongreater than 2x baseline
api_429_countcounterRate limitingAny sustained 429s
watermark_lag_secondsgaugeHow far behind you areGreater than SLA, like 900 seconds
error_countcounterReliabilityAny non-zero with paging failures

If you are using n8n self-hosted, push metrics to Prometheus via a lightweight endpoint, or log structured JSON and derive metrics in your log pipeline.

Drift Detection Checks#

Add a reconciliation workflow daily or weekly:

  • Compare counts by day: source updates per day vs target updates per day
  • Compare sample entity hashes: pick 100 random IDs and compare canonical fields
  • Compare min and max updated_at and the watermark lag

Example reconciliation query idea:

  • Source: number of contacts updated in last 24 hours
  • Target: number of rows with updated_at in last 24 hours

If source is 10,000 and target is 8,000, you have drift and need investigation.

⚠️ Warning: Do not rely on “workflow succeeded” as a correctness signal. A workflow can succeed while silently skipping records due to filter or pagination bugs.

# Common Pitfalls and How to Avoid Them#

  1. 1
    Using offset pagination on mutable datasets — switch to cursor pagination or stable sort by updated_at plus id.
  2. 2
    Advancing watermark before commit — only commit watermark after destination write success.
  3. 3
    Using timestamps without overlap — add an overlap window and tie-breaker to avoid missing late updates.
  4. 4
    No idempotency on writes — use upserts or idempotency keys so retries are safe.
  5. 5
    No durable state — store watermark and dedup in Postgres or Redis, not only in memory.
  6. 6
    No drift monitoring — add reconciliation and metric-based alerts early.

If you need robust retry and alert patterns in n8n, implement them using the approach in n8n Error Handling, Retries, and Alerting.

# Key Takeaways#

  • Prefer cursor-based pagination or stable sorting by updated_at plus id to avoid skipped or duplicated pages.
  • Implement incremental loads with overlap and store a last committed watermark, not last fetched.
  • Make every write idempotent using upserts and idempotency keys so retries and replays are safe.
  • Persist deduplication keys in Postgres or Redis to survive restarts and prevent duplicate processing across runs.
  • Add monitoring and reconciliation using metrics like watermark lag, duplicates skipped, and daily count comparisons to catch drift early.

# Conclusion#

A production-grade n8n data sync workflow is a state machine: deterministic pagination, cautious watermarks, idempotent writes, and durable dedup storage turn inevitable retries into safe replays.

If you want Samioda to design and implement a reliable sync for your CRM, billing system, or internal platform, contact us with your source and destination details and your required sync SLA. We will propose a workflow architecture, state storage, and monitoring plan that prevents drift and scales with volume.

FAQ

Share
A
Adrijan OmićevićSamioda Team
All articles →

Need help with your project?

We build custom solutions using the technologies discussed in this article. Senior team, fixed prices.