Apache Derby Troubleshooting: Defusing Stalls, Deadlocks, and Log Sync Bottlenecks in Enterprise Deployments

Details: Category: Databases; By Mindful Chase; 11.Aug; Hits: 229

Apache Derby is a lightweight, embeddable relational database for the JVM that also runs as a standalone Network Server. Its portability makes it popular in enterprise middleware, OSGi containers, and edge applications where zero-admin databases are needed. Yet, teams often encounter elusive production freezes, lock storms, and log sync stalls that are hard to reproduce. This article unpacks a rarely discussed but high-impact scenario: cluster-wide request stalls and deadlocks triggered by Derby's page latch contention, transaction log fsync delays, and mis-tuned cache/lock settings under mixed embedded and Network Server usage. We will map symptoms to root causes, walk through diagnostics, and assemble pragmatic, long-term fixes tuned for senior architects and tech leads.

Background and Architectural Context

Derby implements a classic ARIES-inspired Write-Ahead Logging (WAL) storage engine with B-Tree indexes, page latching, and transaction locking. It is designed for small footprints and predictable durability rather than raw throughput. In many enterprises, Derby runs in two patterns:

Embedded mode: applications load the Derby engine in-process. The JDBC URL looks like jdbc:derby:/path/to/db;create=true. Threads in the same JVM contend for the same engine resources.
Network Server: a separate process exposes DRDA over TCP. Clients connect via jdbc:derby://host:1527/db. The server serializes access through the same storage engine while introducing a DRDA layer.

Because Derby seeks to remain lightweight, default settings for page cache size, lock timeouts, and log sync are conservative. On modern SSDs or container volumes, these defaults can become a bottleneck. When multiple components (for example, a microservice in embedded mode for local tasks and a batch job through Network Server) hit the same database, contention can explode in non-obvious ways.

Why complex failures emerge in enterprise systems

Mixed access patterns: long-running read transactions from reports intersect with short OLTP updates, increasing lock waits and deadlock cycles.
Storage jitter: fsync latency spikes on networked or overlay filesystems (NFS, SMB, certain container storage drivers) cause log writes to stall the commit path.
JVM pauses and thread starvation: GC or misconfigured thread pools increase latch hold times, amplifying apparent database "freezes".
Under-provisioned caches: small derby.storage.pageCacheSize translates to elevated physical reads and more frequent latch acquisition, magnifying contention.

Problem Statement

The production symptom is a periodic or sudden loss of throughput. Clients observe rising response times, timeouts, and exceptions such as SQLState=40XL1 (deadlock), SQLState=40XD0 (lock timeout), SQLState=XJ040 (general error), or anomalies like "No current connection" after a long stall. Monitoring shows CPU idle yet threads pile up. The Derby log records bursts of lock wait messages or long checkpoint intervals. On Network Server, DRDA worker threads appear blocked in storage calls. When using embedded + Network Server against the same database files, contention spikes are especially severe.

Anatomy of the Failure

1) Page latch versus transaction locks

Derby uses latches (short critical sections) for page-level access and locks for transactional isolation. Under cache pressure, more page fetches occur, raising latch turnover. If a thread holding a latch stalls (GC or I/O), other threads back up. This often looks like a database-wide freeze although the lock table shows few true deadlocks.

2) WAL fsync and checkpoint cadence

Commits synchronously flush log buffers. If derby.storage.fileSyncTransactionLog is true (default), Derby relies on fsync durability. On filesystems with bursty latency, synchronous log writes serialize many transactions. Checkpoints must also flush dirty pages; if page cache is small or dirty ratio is high, checkpointing takes longer, extending stalls.

3) Lock escalation and hot B-Tree pages

Skewed access on a popular key range creates hotspots in B-Tree pages. Combined with small node sizes and frequent splits under write load, latch contention spikes. Lock escalation attempts may worsen tail latency.

4) Embedded and Network Server on the same store

Sharing a database directory between an embedded engine and a Network Server process is not supported concurrently. Even with seemingly separate phases, leftover file locks or unexpected engine states can create corruption risks and prolonged recovery. Intermittent access patterns make this look like "random" stalls.

Diagnostics: From Symptom to Root Cause

Use a layered diagnostic runbook that correlates Derby's engine internals with OS and JVM signals.

Capture Derby error and lock diagnostics

-- Enable detailed error logging (set before engine starts)
# JVM property
-Dderby.stream.error.file=/var/log/derby/derby.log

-- At runtime via SQL, inspect locks
CALL SYSCS_UTIL.SYSCS_MONITORING_PROC('LOCK_TABLE'); -- pseudo, see queries below

-- Practical query on lock waits
SELECT * FROM SYSCS_DIAG.LOCK_TABLE;
SELECT * FROM SYSCS_DIAG.LOCK_WAIT; -- if available in your Derby build

During a stall, the lock tables may show long waits on specific tables or indexes. Lack of deadlocks but high wait times hints at latching or I/O stalls rather than pure transaction conflicts.

Thread dumps to pinpoint latch holders

# Send signal or use jcmd/jstack
jcmd <PID> Thread.print > /tmp/tdump.txt

# Look for Derby engine classes holding latches
org.apache.derby.impl.store.raw.data.BasePage
org.apache.derby.impl.store.raw.xact.Xact
org.apache.derby.impl.services.locks.*
org.apache.derby.impl.store.raw.log.*

If many threads show WAITING or BLOCKED around page access while one or two are RUNNABLE in I/O or fsync, you're likely facing log or page flush stalls.

Observe filesystem latency

# Linux blk and fs metrics (requires privileges)
iostat -x 1
pidstat -d 1 -p <PID>
strace -f -tt -p <PID> -e trace=fdatasync,fsync,pwrite,pwrite64 --timestamp

# Container volume checks
cat /proc/mounts
df -T /path/to/derbydb

Large gaps in fsync completion times correlate with commit stalls. On NFS or certain overlay drivers, 99p latencies may be 100+ ms, disastrous for OLTP.

Check page cache sizing and dirty pressure

-- Display boot properties (if logged) and checkpoint intervals
-- Review derby.log for:
--   derby.storage.pageCacheSize
--   Checkpoint started/ended timestamps

-- Validate index health
CALL SYSCS_UTIL.SYSCS_CHECK_TABLE('APP', 'YOUR_TABLE');

Long checkpoints and frequent page evictions indicate insufficient cache. SYSCS_CHECK_TABLE reveals structural issues after crashes or split storms.

Network Server specific probes

# DRDA slice tuning and thread status
-Dderby.drda.timeSlice=2000
-Dderby.drda.keepAlive=true

-- At runtime, use network monitoring to confirm server backlog
ss -ltnp | grep 1527
netstat -an | grep 1527

If the server thread pool saturates while Derby engine threads are blocked on I/O, you'll see backlog spikes and client timeouts without proportional CPU use.

Common Pitfalls That Amplify the Issue

Relying on defaults: leaving derby.storage.pageCacheSize tiny on modern hardware creates unnecessary latching churn.
Networked filesystems: running Derby's log and data on NFS/SMB without fsync guarantees leads to intermittent stalls and risk of corruption on dropouts.
Mixed embedded + Network Server. Even "temporary" embedded access to a database actively served by Network Server is unsafe.
Long read transactions from reports that pin old versions, stretching checkpoints and compaction, forcing more page IO during active OLTP.
Hot key design: monotonically increasing primary keys cause right-hand B-Tree hot pages and latch contention.

Step-by-Step Remediation Guide

1) Stabilize logging and reduce fsync stalls

# Prefer local SSD for log directory
# Set explicit log path and ensure it's not on a flaky network mount
-Dderby.system.home=/data/derbyhome

# Consider buffered durability for read-mostly or cache-friendly workloads
# (Understand the durability risk before enabling)
-Dderby.system.durability=test

# Keep default durability but align OS and filesystem
mount -o noatime,nodiratime /dev/nvme0n1 /data
sysctl -w vm.dirty_background_ratio=5 vm.dirty_ratio=10

For strict durability, keep full fsync semantics but place logs on low-latency local disks. Only use derby.system.durability=test in non-critical scenarios or with application-level idempotency and replay.

2) Increase page cache and tune checkpoint cadence

# Increase cache: multiply working-set pages by concurrency
-Dderby.storage.pageCacheSize=20000
-Dderby.storage.pageSize=8192

# Encourage steadier checkpoints (property names may vary by version)
-Dderby.storage.logSwitchInterval=104857600

Raising the page cache reduces eviction churn and latch contention. Larger page sizes favor sequential scans but may increase IO per miss; test with your workload. Adjust log switch intervals so checkpoints are frequent but not overwhelming.

3) Eliminate mixed access and enforce single access mode

# Strategy: one process owns the database files at a time
# If you need multi-client access, always go through Network Server
# and do not attach the same DB with embedded engine concurrently.

Split architectural responsibilities: embedded for isolated, private databases; Network Server for shared access. Avoid any accidental embedded usage by central services by auditing JDBC URLs.

4) Reduce lock contention via indexing and statement rewrites

-- Add covering index to reduce row lock footprint
CREATE INDEX IDX_ORDERS_STATUS_DATE
  ON ORDERS(STATUS, ORDER_DATE, ORDER_ID);

-- Convert SELECT ... FOR UPDATE scans into point lookups
-- by ensuring predicates hit selective indexes

-- Break long transactions into smaller, idempotent steps
-- to shorten lock hold times

Covering and selective indexes keep scans short, reducing both lock durations and latch activity on hot pages. Reducing transaction scope is the fastest way to defuse deadlock chains.

5) Adjust lock and deadlock timeouts carefully

-- Set per-DB properties via SQL (persisted)
CALL SYSCS_UTIL.SYSCS_SET_DATABASE_PROPERTY('derby.locks.deadlockTimeout', '10');
CALL SYSCS_UTIL.SYSCS_SET_DATABASE_PROPERTY('derby.locks.waitTimeout', '30');
CALL SYSCS_UTIL.SYSCS_SET_DATABASE_PROPERTY('derby.locks.escalationThreshold', '20000');

-- Verify
VALUES SYSCS_UTIL.SYSCS_GET_DATABASE_PROPERTY('derby.locks.deadlockTimeout');

Shorter deadlock detection reduces the time threads spend waiting in cycles, but too-aggressive timeouts increase retries and user-visible failures. Start conservative, measure, iterate.

6) Mitigate hot B-Tree pages

-- Use randomized or hashed keys for write-heavy tables
-- instead of monotonically increasing IDs

-- Example: store order IDs as ULIDs or UUIDs
-- Application-side, not Derby-specific

-- Periodically rebuild fragmented indexes during maintenance
CALL SYSCS_UTIL.SYSCS_COMPRESS_TABLE('APP', 'ORDERS', 1);

Hashing keys spreads inserts across index leaves. Compression helps after churn or unbalanced growth. Schedule during low-traffic windows.

7) JVM and thread-level hygiene

# GC: prefer G1 or ZGC (newer JVMs) for predictable pauses
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200

# Size thread pools to avoid starvation
-Dderby.drda.maxThreads=256
-Dderby.drda.timeSlice=2000

Shorter GC pauses reduce latch hold times. Right-sizing DRDA thread counts prevents request queuing when a subset of threads is blocked on IO.

8) Build an operational playbook

# On stall: gather evidence in parallel
jcmd <PID> VM.info
jcmd <PID> GC.class_histogram
jcmd <PID> Thread.print
echo w > /proc/sysrq-trigger  # kernel scheduler snapshot (Linux)

# Derby perspective
SELECT * FROM SYSCS_DIAG.LOCK_TABLE;
CALL SYSCS_UTIL.SYSCS_CHECK_TABLE('APP', 'T');

# Filesystem / storage
iostat -x 1
pidstat -d 1 -p <PID>

The goal is to correlate engine state, JVM threads, and storage latency in time. Keep scripts ready; most issues are transient.

Safety, Recovery, and Data Integrity

Crash recovery and roll-forward

# Always stop the engine cleanly before copying the database
# To force recovery on next boot, ensure no leftover lock file
rm -f /path/to/db/db.lck  # only if the JVM is truly down

# Boot and inspect derby.log for recovery progress
# Then run structural checks
CALL SYSCS_UTIL.SYSCS_CHECK_TABLE('APP', 'EACH_TABLE');

Let Derby's WAL recover committed transactions. Avoid manual file edits. If the engine repeatedly replays logs slowly, it is a sign of large checkpoints or a starved page cache; fix those before returning to production traffic.

Backups and consistent snapshots

-- Online backup to a safe location
CALL SYSCS_UTIL.SYSCS_BACKUP_DATABASE('/backups/derby');

-- Or freeze/unfreeze for external snapshots
CALL SYSCS_UTIL.SYSCS_FREEZE_DATABASE();
# Take filesystem snapshot
CALL SYSCS_UTIL.SYSCS_UNFREEZE_DATABASE();

Use built-in backup or freeze to capture consistent images. On virtualized or container platforms, pair FREEZE with volume snapshots to minimize downtime.

Observability: Metrics and Tracing That Matter

While Derby lacks a built-in Prometheus exporter, you can derive useful signals by tailing derby.log and exposing custom metrics:

Commit latency: wrap JDBC commits and emit histograms.
Lock waits: periodically sample SYSCS_DIAG.LOCK_TABLE and export counts by table/index.
Checkpoint duration: parse log messages for start/end timestamps.
I/O latency: OS-level, per-volume 99p fsync/pwrite latency.
Thread states: JVM safepoint time, blocked/runnable counts.

Long-Term Architectural Solutions

Standardize on a single access topology

Pick Network Server for multi-client sharing; restrict embedded use to private, per-process databases. Enforce through dependency management and connection URL linting in CI so no component accidentally boots an embedded engine against shared data.

Storage policy: local SSD only for logs

Adopt a platform standard where Derby's log directory resides on dedicated, low-latency local SSD. If the data directory must be on a slower or networked medium, keep WAL local and periodically back it up. Architect for power-loss safety with reliable hardware flush semantics.

Workload shaping

Segregate read-only reporting to a replicated copy or periodic snapshot. If true streaming is required, consider application-level CQRS where OLTP writes hit Derby and projections feed a reporting store. The objective is to keep long readers from pinning versions and elongating checkpoints.

Schema patterns that avoid hotspots

Use randomized keys (UUID/ULID) for write-heavy tables.
Introduce composite indexes that front-load selective predicates used by SELECT ... FOR UPDATE.
Add narrow, covering indexes for hot read endpoints to minimize row visits.

Operational SLOs and capacity planning

Define SLOs for p95 commit latency and lock wait percentage. Tie them to explicit capacity reserves: minimum page cache size, maximum concurrent writers, and storage p99 fsync budget. Create a "performance budget" document per service that depends on Derby.

Pitfalls to Avoid During Migration or Upgrades

Blind upgrades: Derby versions can change default page sizes or behavior under the covers. Read release notes and retune page cache and log switches after upgrade.
Container defaults: some base images mount the app on overlayfs; move Derby's directories to a hostPath or block-backed volume with guaranteed fsync.
Clock skew: log time drift complicates forensics. Ensure NTP is configured to correlate Derby logs with system metrics.

Performance Tuning Recipes

Recipe A: OLTP microservice on SSD

-Dderby.storage.pageSize=8192
-Dderby.storage.pageCacheSize=30000
-Dderby.storage.fileSyncTransactionLog=true
-Dderby.locks.deadlockTimeout=8
-Dderby.locks.waitTimeout=20
-XX:+UseG1GC -XX:MaxGCPauseMillis=150

Target sub-10 ms commit latency with fast fsync. Monitor checkpoint duration; keep it under a few seconds.

Recipe B: Read-heavy analytics with relaxed durability

-Dderby.system.durability=test
-Dderby.storage.pageCacheSize=60000
-Dderby.locks.waitTimeout=10

Only use if you can replay writes or tolerate loss after a crash. Gains include lower log pressure and smoother checkpoints.

Recipe C: Network Server under bursty load

-Dderby.drda.maxThreads=512
-Dderby.drda.timeSlice=1500
-Dderby.drda.keepAlive=true
-Dderby.storage.pageCacheSize=40000

Ensure the OS has enough file descriptors and the server box has CPU headroom. Pair with connection pool backpressure in clients.

Testing and Reproduction Strategy

Reproducing production stalls requires orchestrated contention and I/O jitter:

Jepsen-style chaos: inject fsync latency using tc netem on networked storage or ionice throttling on local disks.
Mixed workload: run a write-heavy benchmark plus a long read transaction that scans a large table without commit. Observe lock waits rising.
Index split pressure: seed data with increasing keys, then hammer inserts to trigger repeated right-hand page splits.

Example harness

# Pseudo-code sketch
Writer:
  while true:
    INSERT INTO ORDERS(...) VALUES(...);
    if rnd() < 0.1: COMMIT;
Reader:
  BEGIN;
  SELECT * FROM ORDERS WHERE STATUS='OPEN' ORDER BY ORDER_DATE; -- long scan
  sleep(60);
  COMMIT;  -- holds locks/latches longer
I/O Jitter:
  ionice -c3 -p <pid>; stress-ng --hdd 1 --timeout 60s

Observe how commit latency and lock waits behave with and without increased pageCacheSize and different durability settings.

Governance and Runbook Artifacts

For long-lived systems, encode Derby operational knowledge into versioned artifacts:

Property baseline per application, checked into repo.
Runbook with "stall triage" steps (thread dump, lock queries, IO stats commands).
Post-incident template mapping observed symptoms to actions (cache increase, index change, storage migration).

Conclusion

Apache Derby can deliver dependable, low-ops persistence when its engine characteristics are respected. The thorny production failures discussed here arise from the intersection of page latching, WAL fsync behavior, and workload or topology decisions. By placing logs on low-latency storage, right-sizing page cache, avoiding mixed embedded/Network Server access, and curbing lock contention through schema and query design, teams can convert intermittent stalls into predictable performance. Wrap these measures with observability and a practiced triage playbook, and Derby becomes a quiet, efficient workhorse instead of a mysterious bottleneck.

FAQs

1. Why do I see deadlock errors even when my workload is mostly reads?

Long-running read transactions can still hold shared locks and prolong latch hold times, causing writers to wait and increasing the chance of deadlock cycles involving meta operations or index maintenance. Break scans into smaller units and ensure selective indexes reduce scan length.

2. Is it safe to run embedded and Network Server against the same database files?

No. Concurrent access is unsupported and risks corruption or prolonged recovery. Use only Network Server for shared access and prevent embedded drivers from pointing at shared directories via configuration and code reviews.

3. How big should derby.storage.pageCacheSize be?

Size it to keep the hot working set in memory under peak. As a starting point, allocate enough pages to hold your top N hot tables' indexes and the hottest data pages, then iterate using checkpoint duration and eviction rates as feedback.

4. Can I disable fsync to improve performance?

Setting derby.system.durability=test reduces commit latency but risks data loss on crash. Only consider it with strong idempotency and replay guarantees; otherwise, keep fsync and move logs to fast local storage.

5. What's the single fastest mitigation during a live incident?

If storage is healthy, increasing derby.storage.pageCacheSize and restarting during a maintenance window often yields immediate relief by reducing latch turnover. In parallel, identify and index the hottest predicates to shrink lock durations.

Mindful Chase

Writing Code, Writing Stories

tbd

Experience

tbd

Contact Us