# What You’ll Build#
This guide describes a production-ready pipeline for inbound PDFs and images using n8n document processing automation: capture documents, run OCR, classify the document type, extract fields, validate results, route to CRM and accounting tools, and maintain a full audit trail.
The goal is not a demo workflow. It’s a pipeline you can run daily with predictable accuracy, human review guardrails, and error handling that won’t silently corrupt financial data.
# Architecture Overview#
A production pipeline should separate compute-heavy steps, business rules, and downstream integrations. n8n orchestrates the steps, while specialized tools handle OCR and extraction.
High-level flow#
- 1Inbound documents arrive from email, upload form, SFTP, or cloud storage.
- 2Pre-process files and normalize input to images and text.
- 3OCR and layout parsing.
- 4Classification to pick the right extraction strategy.
- 5Field extraction to structured JSON.
- 6Validation and confidence scoring.
- 7Human review for exceptions and low-confidence results.
- 8Routing to systems of record: accounting, CRM, ticketing, storage.
- 9Audit trail and monitoring.
Components and responsibilities#
| Component | Recommended options | Why it matters |
|---|---|---|
| Orchestrator | n8n self-hosted or n8n cloud | Centralized workflow, retries, approvals, logging |
| OCR | Google Document AI, AWS Textract, Azure Form Recognizer, Tesseract | OCR quality drives downstream accuracy |
| Storage | S3-compatible, Google Drive, SharePoint | Immutable source file storage for audit |
| Database | Postgres (recommended) | Persistent state, dedup, audit events, queues |
| Human review | Slack, Teams, Email approvals | Fast exception handling with traceability |
| Downstream | Xero, QuickBooks, NetSuite, HubSpot, Salesforce | Automated posting and routing |
ℹ️ Note: In production, avoid relying only on n8n’s internal execution data as your audit log. Keep a separate persistent audit trail in Postgres or your data warehouse so you can prove what happened even if workflow history is purged.
# Prerequisites#
| Requirement | Version | Notes |
|---|---|---|
| n8n | 1.40+ | Use queues mode for reliability under load |
| Node.js | 18+ | For self-hosted deployments |
| Postgres | 14+ | For state, dedup, audit, and review queue |
| Object storage | Any | S3, GCS, Azure Blob, or SharePoint |
| OCR provider | Any | Pick based on language, cost, and SLA |
| Access tokens | CRM + accounting | Use scoped credentials and least privilege |
If you also want LLM-based extraction or retrieval of vendor metadata, read n8n AI agents and RAG workflows to design the AI parts safely.
# Step 1: Inbound Capture and Normalization#
Inbound is where most production issues start: duplicate emails, corrupted PDFs, password-protected invoices, multi-attachment threads, and low-quality images.
Supported inbound channels#
| Channel | n8n nodes | Typical use |
|---|---|---|
| Email inbox | IMAP Email, Gmail, Microsoft Outlook | Vendors emailing invoices and receipts |
| Upload portal | Webhook, Form tool | Internal staff uploading scanned documents |
| Cloud folder | Google Drive, OneDrive, S3 | “Drop folder” model |
| SFTP | SFTP node | Legacy supplier integrations |
Normalize early#
Normalize everything into a consistent internal representation:
document_idas your own UUID- original file stored immutably
- per-page images for OCR
- metadata like sender, subject, received timestamp, source channel
A simple pattern is: store the original, then create derived artifacts in a folder structure like:
raw/original PDF or imagerendered/PNG pagesocr/JSON responseextracted/normalized JSONaudit/validation and routing events
Example: create a document record and persist the file
// n8n Function node (keep under 20 lines)
const crypto = require('crypto');
const documentId = crypto.randomUUID();
return items.map(item => ({
json: {
document_id: documentId,
source: item.json.source || 'email',
received_at: new Date().toISOString(),
filename: item.json.filename,
},
binary: item.binary,
}));⚠️ Warning: Do not treat the email subject or vendor invoice number as a unique ID. Duplicates happen frequently. Always deduplicate using a content hash plus key metadata.
Deduplication guardrail#
Compute a SHA-256 of the original file bytes and store it in Postgres with a unique constraint. If the same file arrives again, short-circuit with an “already processed” event.
| Dedup key | Pros | Cons |
|---|---|---|
| Content hash | Strong and simple | Fails if vendor re-sends a slightly modified PDF |
| Hash plus sender plus date | Fewer false positives | Requires tuning |
| Invoice number plus vendor | Business meaningful | Unreliable if extraction fails |
# Step 2: Pre-processing for OCR Quality#
OCR performance depends heavily on input quality. Common improvements include deskew, DPI normalization, and page splitting.
Practical rules that improve accuracy#
- Convert PDFs to 300 DPI images for OCR when the PDF is a scan.
- For digital PDFs, extract text directly first and use OCR only as fallback.
- If images are photographed, deskew and increase contrast.
OCR routing strategy: digital-first#
Use this decision:
- If PDF has embedded text above a threshold, skip OCR and use extracted text.
- Else run OCR.
A reasonable threshold is embedded_text_length greater than 200 characters per page.
# Step 3: OCR and Layout Parsing#
Pick an OCR provider with stable latency, good invoice support, and transparent confidence scores. For EU invoices with different languages, cloud OCR tools usually outperform local OCR on layout detection.
Provider comparison#
| Provider | Strengths | Trade-offs |
|---|---|---|
| Google Document AI | Strong invoice parsers, good layout | Cost and GCP lock-in |
| AWS Textract | Solid forms and tables | Table extraction can be noisy on scans |
| Azure Form Recognizer | Good general extraction | Model tuning takes effort |
| Tesseract | Cheap, local, controllable | Lower accuracy on messy scans |
Recommended OCR output schema#
Persist OCR output as JSON with:
- per-page text
- bounding boxes
- per-token confidence if available
- detected language
- provider request ID
This is essential for audit and later debugging when a vendor disputes an amount.
💡 Tip: Store the OCR provider request ID in your audit log. It makes it much faster to open a support ticket with the provider and prove a specific bad OCR run.
# Step 4: Classification (Invoice, Receipt, PO, Contract)#
Classification determines which extraction strategy runs and which validation rules apply.
Practical classification methods#
| Method | When to use | Implementation idea |
|---|---|---|
| Rule-based keywords | Stable templates | Search OCR text for “invoice”, “vat”, “total” |
| ML or LLM classifier | Many layouts | LLM returns one label plus confidence |
| Vendor-based mapping | Known vendors | Map sender domain to vendor and document type |
A robust approach is a hybrid:
- 1If sender is known vendor, use vendor mapping.
- 2Else rule-based quick classifier.
- 3Else LLM classifier, but require confidence plus validation later.
Minimal classifier prompt strategy#
If you use an LLM, you want short input, deterministic output, and a confidence score. Only send the first page text plus a few key lines like totals and headers, not the entire OCR.
# Step 5: Field Extraction to Structured JSON#
Extraction should output a stable schema regardless of provider. Downstream accounting tools want predictable fields.
Recommended normalized schema#
| Field | Type | Notes |
|---|---|---|
doc_type | string | invoice, receipt, purchase_order |
vendor_name | string | Prefer vendor master match |
vendor_vat_id | string | Validate by country rules |
invoice_number | string | Required for invoices |
invoice_date | date | ISO format |
due_date | date | Optional |
currency | string | ISO 4217 |
net_amount | number | Validate math |
vat_amount | number | Validate math |
gross_amount | number | Validate math |
iban | string | Optional |
line_items | array | Quantity, unit price, tax |
confidence | object | Per-field confidence |
Extraction strategies#
- 1Template extraction for known vendors.
- 2Parser extraction using OCR provider invoice models.
- 3LLM extraction for semi-structured invoices, but only with strong validation.
A production pattern is to try them in that order and stop at the first that passes validation with an acceptable confidence score.
Example: normalize numbers and dates
// n8n Function node
function toNumber(v) {
if (v === null || v === undefined) return null;
return Number(String(v).replace(/\s/g, '').replace(',', '.'));
}
function toISODate(v) {
if (!v) return null;
const d = new Date(v);
return isNaN(d.getTime()) ? null : d.toISOString().slice(0, 10);
}
items[0].json.net_amount = toNumber(items[0].json.net_amount);
items[0].json.gross_amount = toNumber(items[0].json.gross_amount);
items[0].json.invoice_date = toISODate(items[0].json.invoice_date);
return items;# Step 6: Validation and Accuracy Guardrails#
Validation is what makes the pipeline safe. Your goal is not “extract something”, it’s “extract something correct enough to post to accounting”.
Core validation rules for invoices#
| Rule | Example | Action if fails |
|---|---|---|
| Required fields present | invoice number, date, vendor, gross | Send to review |
| Amount math checks | net + vat = gross within tolerance | Review or re-extract |
| Currency recognized | EUR, USD, GBP | Review |
| VAT ID format | HR12345678901 | Review |
| IBAN checksum | HR1210010051863000160 | Review |
| Duplicate invoice | same vendor + invoice number | Block and alert |
Use a tolerance for rounding. A common rule is abs((net + vat) - gross) less than 0.02 in currency units for two-decimal currencies.
Confidence scoring that works in practice#
Compute a document-level confidence score from field confidences and validation outcomes.
One practical scoring model:
- Start at 100
- Subtract 30 for missing required field
- Subtract 20 for math mismatch
- Subtract 10 for low OCR quality
- Subtract 10 per critical field confidence below 0.8
Then decide:
- Score 85 to 100: auto-route
- Score 60 to 84: route with human review required
- Score below 60: re-run OCR or extraction with fallback, then review
🎯 Key Takeaway: Treat validation failures as a signal to change the processing path, not as an error to ignore. Most “automation disasters” happen when invalid data is still pushed to accounting.
Error handling and retries#
OCR APIs and SaaS integrations fail. Build retries with backoff and alerts.
Use a consistent pattern: retry transient errors, stop on hard validation failures, and raise alerts when manual intervention is needed. For implementation patterns, use n8n error handling, retries, and alerting.
# Step 7: Human Review Workflow (Fast, Auditable, and Safe)#
Human-in-the-loop is not a compromise. It is how you reach high accuracy while still saving time.
When to require review#
- Any missing required field
- Any math mismatch
- Any low-confidence critical field
- Any new vendor not in master data
- Any duplicate suspicion
- Any “high-risk” invoices above a threshold, for example above 5000 EUR
How to implement approvals in n8n#
Use an approval step that sends the extracted fields plus a link to the original PDF and OCR overlay.
A practical review message includes:
- vendor name
- invoice number
- date
- gross amount
- VAT amount
- suggested account code and cost center
- a “Approve” and “Request changes” path
For implementation details across Slack, Teams, and email, follow n8n approval workflows.
Review queue data model#
Persist review tasks in Postgres, not just in Slack threads.
| Column | Type | Purpose |
|---|---|---|
task_id | UUID | Review task identifier |
document_id | UUID | Link to document |
status | string | pending, approved, rejected |
assigned_to | string | Email or user ID |
payload | JSON | Extracted fields and validation |
decision | JSON | Approved edits, comments |
created_at | timestamp | SLA tracking |
decided_at | timestamp | Audit |
This enables SLAs, dashboards, and a complete audit trail.
# Step 8: Routing to CRM and Accounting Tools#
Routing is where you realize the business value: invoices go to accounting, receipts to expenses, purchase orders to procurement, and customer documents to CRM.
Routing matrix#
| Document type | Route | Example integration |
|---|---|---|
| Supplier invoice | Accounting AP | Xero bills, QuickBooks bills, NetSuite vendor bill |
| Receipt | Expense tool | Expensify, Rydoo, or accounting expense |
| Purchase order | Procurement | ERP module or ticketing |
| Customer order form | CRM | HubSpot deal note, Salesforce attachment |
| Contract | DMS | SharePoint, Google Drive with metadata |
Posting strategy for accounting#
Post in two steps:
- 1Create a draft bill with extracted header fields.
- 2Add line items, tax codes, and attachments.
- 3Mark as “ready for approval” or keep as draft depending on policy.
If the accounting system supports it, include your internal document_id in a custom field to ensure traceability and deduplication.
⚠️ Warning: Never post directly to “approved” or “paid” states from automation unless you also have strong controls, role-based access, and separate approvals. Keep automation in draft-first mode for most teams.
# Step 9: Audit Trails and Observability#
Audit trails are non-negotiable for finance and compliance. You should be able to answer: what was received, what was extracted, who approved it, what changed, and what was sent to downstream systems.
What to log for each document#
| Audit event | Stored fields | Why |
|---|---|---|
| Inbound received | source, sender, timestamps, file hash | Proves origin and dedup |
| OCR completed | provider, request ID, confidence, duration | Debugging and SLA |
| Classified | label, confidence, method | Explains routing choice |
| Extracted | extracted JSON, per-field confidence | Reproducibility |
| Validated | rule results, score, failures | Justifies review requirement |
| Reviewed | reviewer, edits, decision, timestamps | Compliance |
| Routed | destination IDs, payload, status | Traceability |
| Errors | stack trace, retry count, final status | Reliability |
Metrics that matter#
Track metrics weekly and improve the system based on evidence:
- straight-through processing rate, percentage processed without review
- review rate by vendor and by document type
- field-level accuracy for critical fields
- average processing time end-to-end
- cost per document, including OCR and manual time
A baseline target many finance teams aim for is 60 to 80 percent straight-through for invoices after vendor mapping and template tuning, with the remaining routed to review. Your actual number depends on document variability and scan quality.
# Production Hardening Checklist#
Security and privacy#
- Encrypt files at rest in object storage.
- Restrict n8n credentials to least privilege.
- Mask sensitive fields in logs where possible.
- Use a separate service account per integration.
Workflow design for reliability#
- Use queues mode and separate workers for OCR-heavy flows.
- Add timeouts for external API calls.
- Implement idempotency keys on accounting posts.
- Always store intermediate artifacts before calling downstream systems.
Fallback strategies#
| Failure type | First response | Fallback |
|---|---|---|
| OCR timeout | Retry with backoff | Switch provider or reduce DPI |
| Low OCR confidence | Re-render pages | Force OCR with stronger settings |
| Extraction fails | Try alternate extractor | Human review with OCR text |
| Downstream API error | Retry if transient | Create a ticket and pause routing |
# Example n8n Workflow Outline (Nodes and Flow)#
This section gives you a practical “node map” you can implement quickly.
Main workflow nodes#
| Stage | Node examples | Output |
|---|---|---|
| Inbound | IMAP Email Trigger or Webhook | Binary file + metadata |
| Persist raw | S3 or Drive upload + Postgres insert | document_id, file URL, hash |
| Text detection | PDF extract or OCR decision Function | route to OCR or parse text |
| OCR | HTTP Request to provider | OCR JSON |
| Classification | Function + optional LLM call | doc_type, confidence |
| Extraction | Function + optional OCR model parsing | normalized JSON |
| Validation | Function + Postgres checks | validation report + score |
| Human review | Slack/Teams/Email approval | decision + edits |
| Routing | Accounting/CRM nodes | external IDs |
| Audit | Postgres insert events | audit record |
Minimal audit event insert#
-- Postgres node query example
INSERT INTO document_audit_events
(document_id, event_type, event_payload, created_at)
VALUES
($1, $2, $3::jsonb, NOW());Use parameters for $1, $2, $3 from n8n to avoid string concatenation issues.
# Key Takeaways#
- Design n8n document processing automation as a pipeline with persistent state: raw file, OCR output, extracted JSON, validation, routing, and audit events.
- Use a digital-first OCR strategy: extract embedded PDF text when available, run OCR only when needed, and store OCR provider request IDs for support.
- Combine classification plus validation: classification picks an extraction path, but validation decides whether the result is safe to route automatically.
- Implement human review as a first-class feature with a Postgres-backed review queue, tracked decisions, and edits stored as audit events.
- Route downstream in draft-first mode for accounting, and enforce idempotency keys to prevent duplicate bills and corrupted ledgers.
- Treat error handling as part of the product: build retries, alerts, and fallbacks using patterns from n8n error handling.
# Conclusion#
A production-ready document pipeline is not “OCR plus an API call”. It’s OCR, classification, extraction, deterministic validation, safe routing, and an audit trail that stands up to finance and compliance scrutiny.
If you want Samioda to implement this end-to-end, including vendor template tuning, approval workflows, and accounting integrations, contact us via our automation services and we’ll help you ship a reliable n8n document processing automation system in weeks, not months.
FAQ
More in Business Automation
All →Automated Reporting with n8n: Build Weekly KPI Digests from GA4, Stripe, and Postgres
A practical guide to automated reporting with n8n: pull weekly KPIs from GA4, Stripe, and Postgres, validate data quality, generate a concise narrative summary, and send it to Slack and email with retries and maintainable structure.
n8n + Supabase/Postgres Automation Patterns: Webhooks, RLS-Safe Writes, and Reliable Sync
A practical guide to n8n Supabase Postgres automation patterns: webhook ingestion, idempotency keys, upserts, RLS-safe writes, and reliable two-way sync for SaaS back-office workflows.
Building AI Agent Workflows in n8n: RAG, Tool Use, and Guardrails for Production
A practical end-to-end guide to an n8n AI agent RAG workflow: ingest documents, chunk and embed, store in a vector DB, query with an LLM, and ship safely with PII controls, prompt-injection defenses, cost limits, and human approvals.
Need help with your project?
We build custom solutions using the technologies discussed in this article. Senior team, fixed prices.
Related Articles
How to Automate Your CRM with n8n: Practical Guide (Lead Scoring, Follow-ups, Reporting)
A practical 2026 guide to CRM automation n8n: connect HubSpot or Pipedrive, build lead scoring, automated follow-ups, and reporting workflows with copy-pasteable examples.
10 E-Commerce Automation Workflows That Save Hours Every Week (n8n Examples)
A practical guide to ecommerce automation workflows: 10 proven automations for order processing, inventory alerts, reviews, abandoned carts, support, and analytics — with n8n workflow examples you can copy.
Automated Reporting with n8n: Build Weekly KPI Digests from GA4, Stripe, and Postgres
A practical guide to automated reporting with n8n: pull weekly KPIs from GA4, Stripe, and Postgres, validate data quality, generate a concise narrative summary, and send it to Slack and email with retries and maintainable structure.