Troubleshooting Fossil in the Enterprise: Sync, Storage, Hash Policy, and Privacy at Scale

Details: Category: Version Control; By Mindful Chase; 11.Aug; Hits: 195

Fossil is a distributed version control system designed by the SQLite team to prioritize simplicity, durability, and an integrated toolchain (issue tracker, wiki, forum, technotes, and a self-hosted web UI). At enterprise scale, teams occasionally encounter obscure, high-impact problems that are seldom asked on forums: repository locking under load-balanced HTTPS, hash-policy migrations, autosync behavior behind restrictive proxies, private-data leaks during clone or backup, and long-running rebuilds after file-system events. These issues rarely appear in small projects but can stall large mono-repos, CI pipelines, and compliance workflows. This guide provides deep technical troubleshooting, root-cause analysis, and durable fixes geared for architects, tech leads, and decision-makers running Fossil across controlled networks and multi-tenant infrastructure.

Mindful Chase

Writing Code, Writing Stories

tbd

Experience

tbd

More to Explore

Background: How Fossil's Architecture Shapes Failure Modes

Fossil bundles DVCS, a web UI, and project-management features over a single SQLite repository. Its model emphasizes immutable artifacts, manifests, and a timeline rendered directly from the repo. Sync uses an efficient, delta-aware protocol over HTTP(S) to exchange missing artifacts. This integrated architecture yields powerful operational benefits: zero external database dependencies, portable backups, and a consistent, auditable history. At the same time, certain enterprise conditions—reverse proxies, aggressive TLS middleboxes, slow or lossy WANs, and file-system snapshots—interact with Fossil's design in surprising ways.

Key architectural facts relevant to troubleshooting:

Single-file repository: A SQLite DB (*.fossil) contains everything. Concurrency and locking semantics therefore inherit SQLite's rules.
Autosync: By default, client operations (e.g., commit, update) may trigger a push/pull. Firewalls or proxies that mutate HTTP requests can break autosync.
Hash policy: Modern Fossil defaults to SHA3-256; legacy repos may use SHA1. Mixed environments introduce verification warnings or sync refusals until policy aligns.
Integrated services: Tickets, wiki, forum, and unversioned content share the same repository. Misconfigured permissions or backups may inadvertently expose data.
Self-hosted HTTP: Fossil can run in standalone server, CGI/SCGI, or behind a reverse proxy. Each deployment path has distinct timeout, buffering, and auth behaviors that affect sync.

Architecture and Deployment Patterns: Where Enterprises Get Bitten

Load Balancers and Reverse Proxies

Many enterprises place Fossil behind NGINX/Apache/ATS or an L7 appliance that terminates TLS and forwards to an internal Fossil process or CGI. Problems arise when:

Chunked encoding is rebuffered or disabled, delaying request bodies and causing timeouts during large syncs.
Idle timeouts are too short for rebuild or heavy push operations.
Sticky sessions are disabled, sending incremental sync requests to different backends and invalidating stateful assumptions.

Network Security Layers

Inline DLP/IDS systems sometimes rewrite or block POSTs with binary payloads, triggering sync errors that masquerade as authentication failures. TLS inspection can also break client certificate auth or downgrade ciphers that Fossil expects.

Storage, Snapshots, and Backups

Because a Fossil repo is an active SQLite DB, inconsistent snapshots or copy-on-write clones taken mid-transaction can produce integrity errors. VM or container snapshots without quiescing the file may require a rebuild on restore.

Diagnostics: From Symptom to Root Cause

Symptom 1: "database is locked" or long stalls under write load

When multiple writers push to the same server repo, short-duration locks are normal; persistent lock errors indicate contention or operational hazards (slow I/O, antivirus scans, or backup agents holding read handles). Look for these patterns:

Lock only during certain cron windows (backup agent collision).
Stalls correlated with large pushes from remote sites (WAN jitter).
Locks triggered by web UI browsing of heavy timeline pages while a push is occurring (same SQLite file used for reads and writes).

fossil info
fossil timeline -n 20 -type ci
fossil dbstat
fossil setting autosync
# Server-side (CGI or standalone): enable verbose logs
# and inspect reverse proxy access/error logs for timeouts

Symptom 2: Sync errors through proxies (e.g., 502/504, "protocol error")

Autosync uses HTTP POST with custom payloads. Proxies with small buffers, short timeouts, or disabled chunked encoding induce failures. Confirm whether direct-to-server sync succeeds while proxied sync fails.

# Attempt a direct sync (bypassing proxy)
fossil sync --verbose --httptrace --url https://fossil-backend.internal/repo
# Compare with proxy URL
fossil sync --verbose --httptrace --url https://fossil.example.com/repo

Symptom 3: Hash mismatches or "unknown artifact" after upgrades or cross-site clones

Mixed hash policies (SHA1 vs SHA3-256) or a partially shunned artifact set can trigger verify failures. Diagnose with settings and a repository verify.

fossil settings hash-policy
fossil verify
fossil shun ls
fossil whatis ARTIFACT_ID

Symptom 4: Private data unexpectedly present in clones or backups

Fossil supports private branches, private tickets, and unversioned content. A misstep during mirroring, export, or git-bridge operations can unintentionally publicize private artifacts. Audit with privacy-aware commands and scrub if needed.

fossil privacy
fossil scrub --private --verily
fossil uv list
fossil ticket list --private

Symptom 5: Web UI timeouts and "I/O error" during timeline or diff

Large diffs or massive timelines over slow disks can exceed proxy or server timeouts. Identify whether the backend or proxy is timing out first, and profile disk latency.

# Server logs (standalone)
fossil server repo.fossil --port 8080 --scgi --th-trace
# Reverse proxy logs: check upstream_read_timeout / proxy_read_timeout
# OS-level: iostat, vmstat to observe disk stalls

Common Pitfalls and Why They Occur

Running Fossil on network file systems: NFS/SMB semantics and locking can corrupt or stall SQLite writes. Prefer local SSD or a robust block device with fsync guarantees.
Inconsistent snapshotting: VM or storage snapshots taken without SQLite's backup API can capture a half-committed state. Restores then force expensive recovery or rebuilds.
Proxy defaults: L7 devices ship with conservative body-size, buffering, and timeout settings; they are unsuitable for large binary sync payloads until tuned.
Mixed client versions: Old clients speaking older sync dialects may not understand modern server responses, especially when hash policies differ.
Multi-writer contention: Burst pushes from CI and developers to the same repo on spinning disks produce "database is locked" spikes.

Step-by-Step Fixes

1) Stabilize Storage and Concurrency

Ensure the repository resides on low-latency, durable storage. Avoid network file systems for the live repo; if mandated, configure strict POSIX locking and test aggressively.

# Move repo to local SSD and vacuum to improve locality
systemctl stop fossil-server
cp /mnt/nas/repo.fossil /var/lib/fossil/repo.fossil
fossil vacuum /var/lib/fossil/repo.fossil
systemctl start fossil-server

For heavy write contention, consider a read-replica pattern for the web UI by serving a hot backup copy updated on a schedule, keeping the writer repo isolated for sync traffic.

# Create a consistent backup using sqlite3 backup API
sqlite3 /var/lib/fossil/repo.fossil ".backup /var/lib/fossil/repo-ro.fossil"
# Point web UI at repo-ro.fossil (read-only), while writers push to repo.fossil

2) Repair and Reindex After Anomalies

After storage or snapshot incidents, run integrity checks and, if needed, rebuild derived content tables from canonical artifacts.

fossil verify
fossil check --integrity
# Rebuild derived tables (manifests, delta chains, etc.)
fossil rebuild --stats --cluster repo.fossil
# If rebuild warns about shunned artifacts, review and re-run
fossil shun ls
fossil rebuild --noverify

rebuild is CPU and I/O intensive; schedule during low-traffic windows and temporarily raise server/proxy timeouts.

3) Align Hash Policy Across Fleet

Pick a single hash algorithm (prefer SHA3-256) and enforce it on server and clients. Mixed policy increases verification friction and confuses audits.

# On server repo
fossil settings hash-policy sha3
# On clients
fossil settings hash-policy sha3
# Verify no legacy SHA1-only artifacts remain unreferenced
fossil verify

4) Tune Reverse Proxies for Fossil Sync

Configure buffering and timeouts to match expected artifact sizes and latency. Ensure sticky sessions for multi-backend topologies.

# NGINX example (conceptual)
proxy_request_buffering off;
proxy_buffering off;
proxy_read_timeout 600s;
proxy_send_timeout 600s;
client_max_body_size 1g;
proxy_set_header Connection "";
# Sticky sessions (via hash or cookie) to keep sync on one backend

On Apache, review ProxyTimeout, LimitRequestBody, and any modules that alter chunked requests.

5) Harden TLS and Auth Without Breaking Clients

Enterprises often enforce mutual TLS or SSO. Confirm Fossil's authentication realm matches reverse proxy headers and that TLS terminators pass client DN or headers consistently.

# Fossil CGI environment mapping (illustrative)
# Ensure REMOTE_USER or auth headers survive the proxy hop
RequestHeader set X-Remote-User %{REMOTE_USER}e env=REMOTE_USER
# In Fossil settings, choose login-group or header-based auth as appropriate

6) Control Autosync in Hostile Networks

Disable autosync by default for developers behind restrictive proxies, and provide a "sync-on-demand" script with verbose tracing for support.

fossil settings autosync off
# Team wrapper
fossil sync --verbose --httptrace --ssl-identity dev-cert.pem --url https://vcs.company/repo

7) Prevent Private Data Leakage

Audit private branches, tickets, forum posts, and unversioned content before publishing mirrors or running exports. Use scrub and shun lists to remove sensitive artifacts.

# Remove private and unreferenced items aggressively
fossil scrub --private --verily --force
# Shun a leaked artifact by ID
fossil shun add ABCDE12345...
fossil rebuild

8) Make Backups Atomic and Restores Predictable

Use SQLite's online backup to capture a consistent file while Fossil is running. Store offsite copies encrypted, and document restore drills that include a verify.

sqlite3 live.fossil ".backup repo-$(date +%F).fossil"
# On restore
cp repo-YYYY-MM-DD.fossil restored.fossil
fossil verify restored.fossil
fossil rebuild restored.fossil

9) Segment Heavy Reads From Writes

Serve web UI browsing from a read-only copy while CI and developers push to the writer. Refresh the read-only copy on a schedule or via hooks post-push.

# Post-push hook (conceptual)
sqlite3 /srv/repo/live.fossil ".backup /srv/repo/ro.fossil"
# Web UI points at /srv/repo/ro.fossil

10) Throttle and Batch CI Pushes

Large CI artifacts and frequent commits can starve human pushes. Batch seldom-needed commits and prefer unversioned content for build outputs that do not belong in history.

# Example: publish build artifacts as unversioned content
fossil uv add dist/app-1.2.3.zip --as releases/app-1.2.3.zip
fossil uv sync
# Keep VCS history clean; avoid binary blobs in normal check-ins

Advanced Diagnostics Playbook

Trace Sync Protocol

Use verbose and HTTP trace flags to capture the sequence of Fossil protocol exchanges. Correlate with proxy logs to identify buffering or auth header loss.

fossil sync --verbose --httptrace 2>&1 | tee sync.trace
# Inspect for long gaps or aborted reads

Inspect and Tune SQLite Settings

Fossil configures SQLite aggressively for durability. For high-throughput servers on reliable storage, test write-ahead logging and page cache settings to reduce lock durations.

fossil sqlite3 repo.fossil 
PRAGMA journal_mode; 
PRAGMA synchronous; 
PRAGMA page_size; 
PRAGMA cache_size; 
-- Adjust cautiously and benchmark before/after

Detect File-System Pathologies

Use OS tools to detect I/O stalls. SSDs with an exhausted write cache or RAID arrays rebuilding can manifest as "database is locked" at the application layer.

iostat -x 1
vmstat 1
dmesg | tail -200

Audit Permissions and Roles

Misconfigured anonymous or reader permissions can expose unversioned content or private tickets through the web UI even if the repo appears gated. Review capabilities and test with a non-privileged user.

fossil user list
fossil user capabilities USERNAME
# Validate caps: o,r,w,a,s,n and custom roles
# Test with a clean browser session

Design Decisions With Long-Term Impact

Choosing the Sync Topology

Fossil supports hub-and-spoke, full mesh, and tiered mirrors. Hubs concentrate writes and simplify backup but can bottleneck on a single repo file. Mesh reduces central bottlenecks but complicates conflict resolution and privacy. For regulated environments, a primary hub + read mirrors pattern with periodic, audited backup is often optimal.

Repository Sizing and Sharding

Fossil handles large repos well, but gigantic binary histories inflate clone times and rebuild windows. Consider splitting repos by product or component, and publish public artifacts via unversioned slots instead of normal check-ins.

Policy for Private Data

Make privacy a first-class policy: tag private branches clearly, restrict private ticket usage to dedicated roles, and integrate scrub into pre-release pipelines.

Version Discipline

Standardize clients and servers per quarter or release train. Outliers should be blocked at login with a message guiding upgrades, reducing protocol drift.

Operational Best Practices

Single-writer window for maintenance: Quiesce pushes before rebuild or schema upgrades.
Immutable backups: Store daily SQLite backups in WORM storage; test restores quarterly.
Proxy blueprints: Keep vetted NGINX/Apache configs with tuned timeouts and buffering; share across teams.
Observability: Export Fossil server logs, proxy metrics, and OS I/O stats into a central system; set SLOs for push latency and web UI median render.
Capacity headroom: Maintain 30–50% IOPS headroom on the storage hosting live repos.
Documentation and runbooks: Codify rollback and disaster recovery; include "how to run rebuild safely" with expected durations.

Pitfall Deep Dives

Private Branch Leaks via Unvetted Mirroring

Symptom: A public mirror unexpectedly displays check-ins that should be private. Cause: A mirror job used credentials with private access and pushed to a public repo. Fix: Revoke credentials, shun leaked artifacts, and scrub private content on the mirror; rotate secrets and split publishing roles.

fossil shun add LEAKED_ARTIFACT_ID
fossil scrub --private --verily --force
fossil rebuild

Stuck Autosync After Proxy Replacement

Symptom: Developers report "protocol error" on commit. Cause: New proxy now buffers request bodies and closes idle connections at 60s. Fix: Increase proxy_read_timeout, disable request buffering, enable sticky sessions, and set autosync off temporarily.

fossil settings autosync off
# After proxy change and validation, re-enable
fossil settings autosync on

Integrity Errors After Restoring From VM Snapshot

Symptom: "database disk image malformed" on first write. Cause: Snapshot captured mid-transaction. Fix: Run verify, then rebuild. Adopt SQLite backup API for future snapshots.

fossil verify
fossil rebuild --stats

CI Overwhelms Server With Many Small Pushes

Symptom: Frequent lock errors during work hours. Cause: Multiple CI pipelines pushing per-commit artifacts and tags. Fix: Batch CI pushes, switch artifacts to unversioned content, and rate-limit write jobs.

fossil uv add build.tar.gz --as ci/builds/123/build.tar.gz
fossil uv sync

Security Considerations

Fossil's role-based access must be paired with sane network boundaries. Require HTTPS with modern ciphers, optional mutual TLS for admins, and short-lived tokens for automation. Avoid exposing the writer repo to the internet; put it behind a VPN or zero-trust gateway. For compliance, treat scrub and shun as change-controlled actions with peer review.

Performance Tuning Checklist

Local SSD with high random write IOPS for the writer repo.
Proxy buffering disabled for sync; generous timeouts for slow WAN clients.
Hash policy unified; clients within one or two minor versions.
Web UI reads served from a read-only backup to isolate query load.
CI pushes batched; large binaries shipped as unversioned assets.
Routine "verify" and periodic "rebuild" in maintenance windows.

Concrete Runbooks

Runbook: Emergency Restore

# 1) Identify last good backup
sqlite3 backup-2024-12-01.fossil "PRAGMA integrity_check;"
# 2) Restore
cp backup-2024-12-01.fossil live.fossil
# 3) Verify and rebuild
fossil verify live.fossil
fossil rebuild live.fossil
# 4) Re-enable service, then run a test push/pull
fossil sync --verbose

Runbook: Proxy Cutover Validation

# Pre-check: baseline against direct backend
fossil sync --url https://backend.internal/repo --httptrace
# Switch DNS to proxy, then re-test
fossil sync --url https://vcs.company/repo --httptrace
# Compare traces and latency; adjust proxy timeouts if gaps > 5s

Runbook: Hash Policy Migration

# Prep: ensure all clients upgraded to support SHA3
fossil settings hash-policy sha3
# Rebuild to verify artifacts
fossil verify
fossil rebuild --stats

Conclusion

Fossil's tight integration and single-file repository yield operational simplicity, but enterprise realities—reverse proxies, security appliances, heavy CI, and snapshot-based backups—introduce failure modes that demand intentional architecture. The surest path to stability is to isolate writes from reads, standardize hash policy and client versions, use SQLite's backup API for consistent copies, tune proxies for long-lived POSTs, and institutionalize privacy hygiene with scrub/shun workflows. Treat the writer repo as a critical database, observe it like any production service, and rehearse rebuild/restore procedures. With these practices, Fossil scales cleanly while retaining its hallmark transparency and auditability.

FAQs

1. How do I safely host Fossil behind a load balancer?

Enable sticky sessions so a sync stays on the same backend, increase read/send timeouts to accommodate large pushes, and disable request buffering for POST bodies. Health-check the backend Fossil process, and keep a single writer repo on fast local storage.

2. When should I run "fossil rebuild" vs "fossil verify"?

"verify" checks artifact integrity and references with minimal disruption, suitable for routine audits. Run "rebuild" after file-system anomalies, shun/scrub operations, or schema upgrades; schedule it during maintenance windows due to heavy I/O.

3. Can I use network file systems for the live repository?

It's risky. NFS/SMB locking and latency can cause "database is locked" or worse. If you must, use a local writer with periodic backups to NFS, and serve the web UI from a read-only copy on the network storage.

4. How do I prevent private data leaks during mirroring?

Restrict mirror credentials to read-only public data, audit private artifacts regularly, and integrate "scrub" before publishing mirrors. Consider a dedicated export repo that never receives private branches or tickets.

5. Why do older clients fail to sync after we changed the hash policy?

They likely do not understand SHA3-256 references. Standardize client versions and enforce an upgrade policy; meanwhile, set clear server messages guiding users to update before enabling strict SHA3-only operation.

Contact Us