CouchDB at Scale: Advanced Troubleshooting for Replication, Indexing, and Compaction

Details: Category: Databases; By Mindful Chase; 10.Aug; Hits: 307

Apache CouchDB underpins mission-critical document stores, offline-first mobile sync, and high-write ingestion pipelines. Its distributed, append-only storage and MVCC revision trees make it resilient, but those same properties can surface perplexing failures at scale: stalled replications, runaway disk growth, inconsistent map/reduce indexes, or cluster quorum edge cases. Because issues often arise from the interaction between Erlang VM tuning, filesystem durability, shard placement, and application-level revision semantics, root cause analysis requires a layered approach. This article provides an end-to-end, senior-level troubleshooting playbook for diagnosing and fixing replication conflicts, index lag, compaction debt, and cluster instability in large CouchDB 2.x/3.x installations.

Mindful Chase

Writing Code, Writing Stories

tbd

Experience

tbd

More to Explore

Background and Architecture

CouchDB stores JSON documents with a multiversion concurrency control (MVCC) model. Each update creates a new revision; conflicts are allowed and resolved at the application layer. In clustered CouchDB, databases are sharded across nodes (usually Q shards with N=number of replicas and W/R=write/read quorum). Data and view indexes are maintained per-shard, with background compaction reclaiming space. The HTTP API exposes operational surfaces for replication, indexing, and scheduler jobs. Understanding how these pieces fit together is essential to correctly interpret symptoms in production.

Key components and flows

Shard groups & quorum: Each database is split into Q logical shards, replicated N times across nodes. Writes require W acknowledgments, reads require R, with typical defaults W=R=1 and N=3.
Replication: Incremental, checkpointed streams reading the _changes feed and writing revisions with conflict preservation.
Indexing: Design documents define map/reduce or Mango indexes. Indexers are per-shard and can lag behind writes under load.
Compaction: Database and view compaction rewrite data to remove tombstones and old revisions; without it, disk bloat and read amplification mount.
Schedulers: The background job scheduler coordinates replications, index builds, and compactions; starvation or misconfiguration here often looks like data staleness.

Symptoms and What They Usually Mean

Replica drift or missing documents across nodes: Replication backlog, scheduler starvation, shard unavailability, or quorum misreads.
Exploding disk usage: Compaction disabled or starved, high conflict rates, frequent large-document updates, or attachment churn.
Slow queries or stale views: View index lag on specific shards, design doc hot code paths, reduce explosion, or Mango index mismatch.
Frequent 409 Conflict: Legitimate MVCC concurrency or application retry loops writing stale revs; also signatures of clock skew in sync clients.
High write latency spikes: fsync storms, small-IO write amplification on certain filesystems, or Erlang scheduler contention.
Node flapping in membership: Epmd/distribution issues, hostname/DNS drift, firewall timeouts, or incompatible cookie/cluster keys.

First-Response Triage Checklist

Capture these artifacts immediately to preserve state for later correlation:

Cluster and node health endpoints: _membership, _node/*_stats, _active_tasks, _scheduler/jobs.
Per-database shard info: _dbs_info, _shards, _revs_limit.
File system saturation: iowait, disk usage per DB and per index directory.
Erlang VM runtime metrics: reductions/sec, run queue, memory by arena.

Quick commands worth memorizing

# Cluster membership
curl -s http://admin:pass@router:5984/_membership | jq .

# What is CouchDB busy with?
curl -s http://admin:pass@router:5984/_active_tasks | jq .
curl -s http://admin:pass@router:5984/_scheduler/jobs | jq '.map({type:id,status:status,db:database})'

# Node runtime stats
curl -s http://admin:pass@router:5984/_node/_local/_stats | jq .

# Database health and shards
curl -sX POST http://admin:pass@router:5984/_dbs_info -H 'Content-Type: application/json' -d '{"keys":["orders","users"]}' | jq .
curl -s http://admin:pass@router:5984/orders/_shards | jq .

# Changes feed sample
curl -s 'http://admin:pass@router:5984/orders/_changes?since=now-1000' | jq .

Deep Diagnostics

1) Replication lag and drift

Symptoms include source and target DBs disagreeing on doc counts or revisions, while _replicator docs show "running". Drill down:

Check the replication scheduler for stalled jobs (status = "crashed" or frequent retries).
Inspect the _changes feed on the source for throughput and sequence progression.
Examine per-shard connectivity: a single unavailable shard blocks complete convergence.

# List replicator DB entries
curl -s http://admin:pass@router:5984/_replicator/_all_docs?include_docs=true | jq '[.rows[]|.doc|{id:_id,state:_replication_state,err:(_replication_state_reason//null)}]'

# Observe live changes with heartbeats
curl -N 'http://admin:pass@router:5984/orders/_changes?feed=continuous&heartbeat=30000'

If changes trickle yet the target does not advance, look for write quorum issues on the target (e.g., N=3, W=2 with a node down). The replicator may retry indefinitely if the target cannot meet W, producing apparent "lag" without clear errors.

2) Conflict storms

Frequent 409 responses or ballooning conflicted leaf count indicate multiple writers racing without revision awareness, mobile sync clients with divergent histories, or poor merge logic. Conflicts are not errors in CouchDB—they are data facts that your code must reconcile. Nevertheless, unmanaged conflict load stresses storage and slows replications.

# Count conflicted docs quickly with a temporary view
curl -sX POST http://admin:pass@router:5984/orders/_temp_view -H 'Content-Type: application/json' -d '{"map":"function(doc){if(doc._conflicts){emit(1,1)} }","reduce":"_count"}'

# Inspect a single doc's rev tree
curl -s http://admin:pass@router:5984/orders/abc123?conflicts=true&revs=true | jq .

High conflict ratios often trace back to PUTs without conditional headers, or clients re-posting stale documents. Enforce If-Match with the current _rev and implement deterministic merges server-side to prune conflict trees.

3) Index lag & stale views

Map/reduce indexers run per-shard and can fall behind. A single slow shard makes requests appear randomly stale. Examine _active_tasks for view group compaction/builds and use include_docs=false queries where possible to reduce IO.

# View group status
curl -s 'http://admin:pass@router:5984/orders/_design/analytics/_view/by_customer?limit=0' -H 'Accept: application/json'
curl -s http://admin:pass@router:5984/_active_tasks | jq 'map(select(.type=="indexer"))'

# Mango index listing
curl -s http://admin:pass@router:5984/orders/_index | jq .

Large reduce functions or map functions emitting unbounded keys amplify rebuild costs. Favor _count reduces and bounded keyspaces; consider Mango with targeted fields and selectors when appropriate.

4) Compaction debt and disk bloat

Append-only storage requires regular compaction. Without it, tombstones, old revisions, and obsolete view btrees persist. Detect compaction debt by comparing DB sizes.active vs sizes.file and monitoring _active_tasks for compactor progress.

# DB sizes
curl -s http://admin:pass@router:5984/orders | jq '{db_name,sizes}'

# Start manual compaction if auto policy is inadequate
curl -sX POST http://admin:pass@router:5984/orders/_compact
curl -sX POST http://admin:pass@router:5984/orders/_view_cleanup

If compaction never finishes, you likely have insufficient IO bandwidth or the compactor is starved by higher-priority tasks. Throttle high-volume writers, increase disk throughput, or reschedule compactions during off-peak windows. For extremely large DBs, shard splitting and data archiving reduce future compaction windows.

5) Cluster instability and node flapping

Nodes that repeatedly leave and rejoin membership cause timeouts, shard unavailability, and scheduler churn. Verify consistent Erlang distribution cookies, stable DNS/hostnames, and firewall rules for intra-cluster ports.

# Membership & node liveness
curl -s http://admin:pass@router:5984/_membership | jq .
curl -s http://admin:pass@nodeA:5986/_up

# Check the Erlang cookie on each node
cat /opt/couchdb/releases/*/vm.args | grep -i cookie

Clock skew also matters: checkpoint comparisons, compaction timestamps, and operational logs become misleading when NTP is misconfigured. Enforce NTP on all nodes and network gear.

6) Filesystem and durability tuning

Write latency cliffs often trace to fsync semantics and small-IO patterns. On Linux, XFS or ext4 with sane mount options perform well. Ensure disks have battery-backed cache or NVMe write durability to keep delayed_commits safe if you disable strict fsync.

# Critical durability settings (local.ini)
[couchdb]
delayed_commits = false

[compactions]
_default = [{db_fragmentation, "30%"}, {view_fragmentation, "30%"}]

[scheduler]
max_jobs = 64
max_jobs_per_db = 2

Only consider delayed_commits=true after a risk assessment; it reduces fsync frequency but risks loss on power failure. For high-ingest workloads, isolate the data directory on low-latency storage and separate OS/swap from CouchDB data paths.

Root Causes and How to Validate Them

Application-level revision misuse

Clients writing without If-Match or fetching, modifying, and PUTting after long delays create conflicts by design. Validate by auditing server logs for PUTs lacking the current _rev and measuring conflict ratios per endpoint.

Overly broad map/reduce and reduce explosion

Dynamic keys with high cardinality plus reduce=true can produce heavy intermediate trees. Confirm by instrumenting view build times and comparing shard-level index file sizes. Consider splitting views or moving aggregation to analytics pipelines where feasible.

Scheduler starvation

When too many jobs compete, long-running replications or indexers monopolize slots. Validate using _scheduler/jobs and increase max_jobs, or partition workloads across separate clusters.

Shard imbalance

A subset of nodes bearing most hot shards leads to skewed load. Use _dbs_info and cluster metadata to map shard placements. Rebalance by adding nodes and creating new DBs with higher Q (followed by controlled data migration) or by reseeding the cluster and re-creating the database with desired parameters.

Step-by-Step Fixes

1) Make writes revision-aware

Require clients to send If-Match with the latest _rev or use optimistic concurrency with conditional updates.

# Safe update with If-Match
REV=$(curl -s http://admin:pass@router:5984/orders/abc123 | jq -r ._rev)
curl -sX PUT http://admin:pass@router:5984/orders/abc123 -H "If-Match: $REV" \
  -H 'Content-Type: application/json' -d '{"_rev":"'$REV'","status":"shipped"}'

For mobile sync, implement deterministic conflict resolution (e.g., CRDT-style merges or last-write-wins with server timestamps) and periodically purge historical conflicts to contain revision trees.

2) Tame compaction debt

Automate compactions using fragmentation thresholds. Validate that compactions complete within maintenance windows; if not, reduce data size per DB or increase IO.

# Example compaction settings (local.ini)
[compactions]
orders = [{db_fragmentation, "25%"}, {view_fragmentation, "25%"}, {from, "02:00"}, {to, "05:00"}]

# Monitor compaction progress
watch -n 5 'curl -s http://admin:pass@router:5984/_active_tasks | jq "map(select(.type==\"database_compaction\"))\u0022'

3) Accelerate and stabilize indexing

Refactor views with bounded keys and use partial indexes when Mango suffices. Enable stale=ok only for read paths tolerant to slight staleness.

// Example design doc (map only)
{
  "_id":"_design/analytics",
  "views":{
    "by_customer":{
      "map":"function(doc){ if(doc.type===\u0027order\u0027 && doc.customer_id){ emit(doc.customer_id, 1) } }",
      "reduce":"_sum"
    }
  }
}

# Query with bounded key range
curl -s 'http://admin:pass@router:5984/orders/_design/analytics/_view/by_customer?startkey="C1000"&endkey="C1999"&reduce=true'

4) Cure scheduler starvation

Increase job slots cautiously and consider dedicating nodes or clusters for heavy index builds.

# local.ini
[scheduler]
max_jobs = 128
max_jobs_per_db = 4
max_retries_per_job = 10

Alternatively, stagger large replications and compactions to avoid contention. Use the replicator DB to schedule fewer, larger replications instead of many tiny ones.

5) Repair cluster membership and network hygiene

Pin consistent hostnames, verify Erlang cookies, and enforce firewall rules. Use TCP keepalives to prevent idle connection drops across L4 devices.

# Example sysctl (Linux)
net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 5

6) Filesystem/IO optimization

Prefer XFS or ext4 with noatime, adequate journal configuration, and queues sized for your SSD/NVMes. Separate data and view index directories if the storage backend allows independent performance tuning.

# Mount example
UUID=... /var/lib/couchdb xfs defaults,noatime,nodiratime 0 0

7) Observability primitives

Emit request IDs and latency histograms at the reverse proxy and correlate with _active_tasks. Export key CouchDB stats to your TSDB (writes per sec, read/write latency, compactions running, view build times, conflicts/sec, disk usage).

# Prometheus scrape via exporter (example metric names)
couchdb_httpd_requests_per_sec
couchdb_database_reads
couchdb_database_writes
couchdb_open_os_files
couchdb_compactions_running

Performance & Cost Optimization

Right-size quorum and replication factors

Choose N, W, R based on failure domains and read/write SLAs. For high-write pipelines, W=1 may be acceptable with asynchronous replication to reach N=3 durability; for stricter guarantees, raise W but ensure IO and latency budgets accommodate it.

Document and attachment strategy

Large binary attachments inflate compaction and replication traffic. Offload to object storage when possible and store only metadata in CouchDB, or enable attachments_since strategies to minimize re-sends. Chunk attachments via clients that support streaming.

Design for bounded revision trees

Set _revs_limit per database to keep trees from growing unbounded, then compact.

curl -sX PUT http://admin:pass@router:5984/orders/_revs_limit -d '30'

Batching and bulk APIs

Use _bulk_docs with new_edits=false judiciously for replication/migration; for normal writes, _bulk_docs reduces HTTP overhead but still respects MVCC. Measure server CPU and disk IO when adjusting batch sizes.

# Bulk write example
curl -sX POST http://admin:pass@router:5984/orders/_bulk_docs -H 'Content-Type: application/json' -d '{"docs":[{"_id":"o1","type":"order"},{"_id":"o2","type":"order"}]}'

Common Pitfalls (and Safer Patterns)

Relying on eventual index update for consistency: Views and Mango indexes are eventually consistent per-shard. For must-be-fresh reads, query with stale=false (costly) or design flows that read by document ID after writes.
Unbounded temporary views: They are expensive and not cached across shards. Promote to design docs or move analytics elsewhere.
Randomized reduce logic: Non-deterministic reduce creates unstable results and continuous rebuilds.
Ignoring attachment deduplication patterns: Re-uploading identical blobs in different docs creates unnecessary storage pressure.
One giant database: Operational blast radius and compaction windows become unmanageable. Prefer logical partitioning.

Security and Compliance Considerations

Enable per-node TLS and strict admin/user separation. Audit _users and _security objects, avoid admin party, and scrub logs for PII. Apply role-based validation functions in design docs to enforce write rules, and use per-database API keys/roles for least privilege.

// Example validate_doc_update (simplified)
function(newDoc, oldDoc, userCtx) {
  if (!userCtx.roles || userCtx.roles.indexOf('orders-writer')===-1)
    throw({forbidden:'insufficient role'});
  if (oldDoc && newDoc._rev !== oldDoc._rev)
    throw({conflict:'stale write'});
}

Disaster Recovery and Data Repair

Hot failover

Maintain extra replicas and test node eviction/rejoin procedures. Keep _replicator definitions declarative and source-controlled to rebuild pipelines quickly.

Rebuilding a corrupted shard

If a shard ETag or btree is corrupt, isolate the node, snapshot other replicas, and reseed the bad replica via filtered replication from a healthy copy. Always validate with _all_docs?limit=0 counts and sampling doc-by-doc diffs.

# Quick cardinality check
curl -s 'http://admin:pass@nodeA:5984/orders/_all_docs?limit=0' | jq .total_rows
curl -s 'http://admin:pass@nodeB:5984/orders/_all_docs?limit=0' | jq .total_rows

Change Management and Testing

Stage config changes and design doc deployments. Index rebuilds are expensive; deploy new design docs alongside old ones, warm them by replaying a slice of production traffic, then switch queries. For cluster topology changes, run synthetic load and replication drills before and after.

Conclusion

Most severe CouchDB incidents are not "database bugs" so much as the side effects of MVCC semantics, shard/quorum mechanics, and background maintenance competing for IO. Senior teams win by engineering for determinism: revision-aware writes, bounded indexes, automated compaction, predictable scheduler capacity, and observability that correlates HTTP paths with shard-level work. Treat conflicts as first-class data, keep compaction ahead of debt, and right-size quorums to your failure model. With these practices, CouchDB remains a durable, scalable backbone for JSON-centric systems—online and offline.

FAQs

1. Why do I see documents "missing" on one node but present on others?

Reads may hit a replica whose shard is behind or temporarily unavailable, especially with R=1. Either increase read quorum, repair the lagging replica, or direct reads by document ID to a healthy shard until convergence.

2. How can I reduce frequent `409 Conflict` responses?

Make clients revision-aware with If-Match, shorten read→modify→write cycles, and implement server-side merge logic. Periodically prune conflicts after merge and run compaction to shrink rev trees.

3. My view index never catches up—what now?

Profile the view definition for key cardinality and reduce complexity, increase indexer job slots, and ensure compaction is not starving IO. Consider Mango or pre-aggregated counters for the hot queries.

4. Is `delayed_commits=true` safe for production?

Only with robust power loss protection (BBU/NVMe PLP) and after a risk review. It lowers fsync frequency but can lose the last few writes on sudden power failure.

5. When should I split a giant database into multiple DBs?

When compaction windows exceed maintenance budgets, hot shards dominate a subset of nodes, or operational blast radius becomes unacceptable. Partition by tenant, geography, or lifecycle to keep shards and indexes bounded.

Contact Us