Background: How Fossil's Architecture Shapes Failure Modes
Fossil bundles DVCS, a web UI, and project-management features over a single SQLite repository. Its model emphasizes immutable artifacts, manifests, and a timeline rendered directly from the repo. Sync uses an efficient, delta-aware protocol over HTTP(S) to exchange missing artifacts. This integrated architecture yields powerful operational benefits: zero external database dependencies, portable backups, and a consistent, auditable history. At the same time, certain enterprise conditions—reverse proxies, aggressive TLS middleboxes, slow or lossy WANs, and file-system snapshots—interact with Fossil's design in surprising ways.
Key architectural facts relevant to troubleshooting:
- Single-file repository: A SQLite DB (
*.fossil
) contains everything. Concurrency and locking semantics therefore inherit SQLite's rules. - Autosync: By default, client operations (e.g., commit, update) may trigger a push/pull. Firewalls or proxies that mutate HTTP requests can break autosync.
- Hash policy: Modern Fossil defaults to SHA3-256; legacy repos may use SHA1. Mixed environments introduce verification warnings or sync refusals until policy aligns.
- Integrated services: Tickets, wiki, forum, and unversioned content share the same repository. Misconfigured permissions or backups may inadvertently expose data.
- Self-hosted HTTP: Fossil can run in standalone server, CGI/SCGI, or behind a reverse proxy. Each deployment path has distinct timeout, buffering, and auth behaviors that affect sync.
Architecture and Deployment Patterns: Where Enterprises Get Bitten
Load Balancers and Reverse Proxies
Many enterprises place Fossil behind NGINX/Apache/ATS or an L7 appliance that terminates TLS and forwards to an internal Fossil process or CGI. Problems arise when:
- Chunked encoding is rebuffered or disabled, delaying request bodies and causing timeouts during large syncs.
- Idle timeouts are too short for rebuild or heavy push operations.
- Sticky sessions are disabled, sending incremental sync requests to different backends and invalidating stateful assumptions.
Network Security Layers
Inline DLP/IDS systems sometimes rewrite or block POSTs with binary payloads, triggering sync errors that masquerade as authentication failures. TLS inspection can also break client certificate auth or downgrade ciphers that Fossil expects.
Storage, Snapshots, and Backups
Because a Fossil repo is an active SQLite DB, inconsistent snapshots or copy-on-write clones taken mid-transaction can produce integrity errors. VM or container snapshots without quiescing the file may require a rebuild on restore.
Diagnostics: From Symptom to Root Cause
Symptom 1: "database is locked" or long stalls under write load
When multiple writers push to the same server repo, short-duration locks are normal; persistent lock errors indicate contention or operational hazards (slow I/O, antivirus scans, or backup agents holding read handles). Look for these patterns:
- Lock only during certain cron windows (backup agent collision).
- Stalls correlated with large pushes from remote sites (WAN jitter).
- Locks triggered by web UI browsing of heavy timeline pages while a push is occurring (same SQLite file used for reads and writes).
fossil info fossil timeline -n 20 -type ci fossil dbstat fossil setting autosync # Server-side (CGI or standalone): enable verbose logs # and inspect reverse proxy access/error logs for timeouts
Symptom 2: Sync errors through proxies (e.g., 502/504, "protocol error")
Autosync uses HTTP POST with custom payloads. Proxies with small buffers, short timeouts, or disabled chunked encoding induce failures. Confirm whether direct-to-server sync succeeds while proxied sync fails.
# Attempt a direct sync (bypassing proxy) fossil sync --verbose --httptrace --url https://fossil-backend.internal/repo # Compare with proxy URL fossil sync --verbose --httptrace --url https://fossil.example.com/repo
Symptom 3: Hash mismatches or "unknown artifact" after upgrades or cross-site clones
Mixed hash policies (SHA1 vs SHA3-256) or a partially shunned artifact set can trigger verify failures. Diagnose with settings and a repository verify.
fossil settings hash-policy fossil verify fossil shun ls fossil whatis ARTIFACT_ID
Symptom 4: Private data unexpectedly present in clones or backups
Fossil supports private branches, private tickets, and unversioned content. A misstep during mirroring, export, or git-bridge operations can unintentionally publicize private artifacts. Audit with privacy-aware commands and scrub if needed.
fossil privacy fossil scrub --private --verily fossil uv list fossil ticket list --private
Symptom 5: Web UI timeouts and "I/O error" during timeline or diff
Large diffs or massive timelines over slow disks can exceed proxy or server timeouts. Identify whether the backend or proxy is timing out first, and profile disk latency.
# Server logs (standalone) fossil server repo.fossil --port 8080 --scgi --th-trace # Reverse proxy logs: check upstream_read_timeout / proxy_read_timeout # OS-level: iostat, vmstat to observe disk stalls
Common Pitfalls and Why They Occur
- Running Fossil on network file systems: NFS/SMB semantics and locking can corrupt or stall SQLite writes. Prefer local SSD or a robust block device with fsync guarantees.
- Inconsistent snapshotting: VM or storage snapshots taken without SQLite's backup API can capture a half-committed state. Restores then force expensive recovery or rebuilds.
- Proxy defaults: L7 devices ship with conservative body-size, buffering, and timeout settings; they are unsuitable for large binary sync payloads until tuned.
- Mixed client versions: Old clients speaking older sync dialects may not understand modern server responses, especially when hash policies differ.
- Multi-writer contention: Burst pushes from CI and developers to the same repo on spinning disks produce "database is locked" spikes.
Step-by-Step Fixes
1) Stabilize Storage and Concurrency
Ensure the repository resides on low-latency, durable storage. Avoid network file systems for the live repo; if mandated, configure strict POSIX locking and test aggressively.
# Move repo to local SSD and vacuum to improve locality systemctl stop fossil-server cp /mnt/nas/repo.fossil /var/lib/fossil/repo.fossil fossil vacuum /var/lib/fossil/repo.fossil systemctl start fossil-server
For heavy write contention, consider a read-replica pattern for the web UI by serving a hot backup copy updated on a schedule, keeping the writer repo isolated for sync traffic.
# Create a consistent backup using sqlite3 backup API sqlite3 /var/lib/fossil/repo.fossil ".backup /var/lib/fossil/repo-ro.fossil" # Point web UI at repo-ro.fossil (read-only), while writers push to repo.fossil
2) Repair and Reindex After Anomalies
After storage or snapshot incidents, run integrity checks and, if needed, rebuild derived content tables from canonical artifacts.
fossil verify fossil check --integrity # Rebuild derived tables (manifests, delta chains, etc.) fossil rebuild --stats --cluster repo.fossil # If rebuild warns about shunned artifacts, review and re-run fossil shun ls fossil rebuild --noverify
rebuild is CPU and I/O intensive; schedule during low-traffic windows and temporarily raise server/proxy timeouts.
3) Align Hash Policy Across Fleet
Pick a single hash algorithm (prefer SHA3-256) and enforce it on server and clients. Mixed policy increases verification friction and confuses audits.
# On server repo fossil settings hash-policy sha3 # On clients fossil settings hash-policy sha3 # Verify no legacy SHA1-only artifacts remain unreferenced fossil verify
4) Tune Reverse Proxies for Fossil Sync
Configure buffering and timeouts to match expected artifact sizes and latency. Ensure sticky sessions for multi-backend topologies.
# NGINX example (conceptual) proxy_request_buffering off; proxy_buffering off; proxy_read_timeout 600s; proxy_send_timeout 600s; client_max_body_size 1g; proxy_set_header Connection ""; # Sticky sessions (via hash or cookie) to keep sync on one backend
On Apache, review ProxyTimeout, LimitRequestBody, and any modules that alter chunked requests.
5) Harden TLS and Auth Without Breaking Clients
Enterprises often enforce mutual TLS or SSO. Confirm Fossil's authentication realm matches reverse proxy headers and that TLS terminators pass client DN or headers consistently.
# Fossil CGI environment mapping (illustrative) # Ensure REMOTE_USER or auth headers survive the proxy hop RequestHeader set X-Remote-User %{REMOTE_USER}e env=REMOTE_USER # In Fossil settings, choose login-group or header-based auth as appropriate
6) Control Autosync in Hostile Networks
Disable autosync by default for developers behind restrictive proxies, and provide a "sync-on-demand" script with verbose tracing for support.
fossil settings autosync off # Team wrapper fossil sync --verbose --httptrace --ssl-identity dev-cert.pem --url https://vcs.company/repo
7) Prevent Private Data Leakage
Audit private branches, tickets, forum posts, and unversioned content before publishing mirrors or running exports. Use scrub and shun lists to remove sensitive artifacts.
# Remove private and unreferenced items aggressively fossil scrub --private --verily --force # Shun a leaked artifact by ID fossil shun add ABCDE12345... fossil rebuild
8) Make Backups Atomic and Restores Predictable
Use SQLite's online backup to capture a consistent file while Fossil is running. Store offsite copies encrypted, and document restore drills that include a verify.
sqlite3 live.fossil ".backup repo-$(date +%F).fossil" # On restore cp repo-YYYY-MM-DD.fossil restored.fossil fossil verify restored.fossil fossil rebuild restored.fossil
9) Segment Heavy Reads From Writes
Serve web UI browsing from a read-only copy while CI and developers push to the writer. Refresh the read-only copy on a schedule or via hooks post-push.
# Post-push hook (conceptual) sqlite3 /srv/repo/live.fossil ".backup /srv/repo/ro.fossil" # Web UI points at /srv/repo/ro.fossil
10) Throttle and Batch CI Pushes
Large CI artifacts and frequent commits can starve human pushes. Batch seldom-needed commits and prefer unversioned content for build outputs that do not belong in history.
# Example: publish build artifacts as unversioned content fossil uv add dist/app-1.2.3.zip --as releases/app-1.2.3.zip fossil uv sync # Keep VCS history clean; avoid binary blobs in normal check-ins
Advanced Diagnostics Playbook
Trace Sync Protocol
Use verbose and HTTP trace flags to capture the sequence of Fossil protocol exchanges. Correlate with proxy logs to identify buffering or auth header loss.
fossil sync --verbose --httptrace 2>&1 | tee sync.trace # Inspect for long gaps or aborted reads
Inspect and Tune SQLite Settings
Fossil configures SQLite aggressively for durability. For high-throughput servers on reliable storage, test write-ahead logging and page cache settings to reduce lock durations.
fossil sqlite3 repo.fossil PRAGMA journal_mode; PRAGMA synchronous; PRAGMA page_size; PRAGMA cache_size; -- Adjust cautiously and benchmark before/after
Detect File-System Pathologies
Use OS tools to detect I/O stalls. SSDs with an exhausted write cache or RAID arrays rebuilding can manifest as "database is locked" at the application layer.
iostat -x 1 vmstat 1 dmesg | tail -200
Audit Permissions and Roles
Misconfigured anonymous or reader permissions can expose unversioned content or private tickets through the web UI even if the repo appears gated. Review capabilities and test with a non-privileged user.
fossil user list fossil user capabilities USERNAME # Validate caps: o,r,w,a,s,n and custom roles # Test with a clean browser session
Design Decisions With Long-Term Impact
Choosing the Sync Topology
Fossil supports hub-and-spoke, full mesh, and tiered mirrors. Hubs concentrate writes and simplify backup but can bottleneck on a single repo file. Mesh reduces central bottlenecks but complicates conflict resolution and privacy. For regulated environments, a primary hub + read mirrors pattern with periodic, audited backup is often optimal.
Repository Sizing and Sharding
Fossil handles large repos well, but gigantic binary histories inflate clone times and rebuild windows. Consider splitting repos by product or component, and publish public artifacts via unversioned slots instead of normal check-ins.
Policy for Private Data
Make privacy a first-class policy: tag private branches clearly, restrict private ticket usage to dedicated roles, and integrate scrub into pre-release pipelines.
Version Discipline
Standardize clients and servers per quarter or release train. Outliers should be blocked at login with a message guiding upgrades, reducing protocol drift.
Operational Best Practices
- Single-writer window for maintenance: Quiesce pushes before rebuild or schema upgrades.
- Immutable backups: Store daily SQLite backups in WORM storage; test restores quarterly.
- Proxy blueprints: Keep vetted NGINX/Apache configs with tuned timeouts and buffering; share across teams.
- Observability: Export Fossil server logs, proxy metrics, and OS I/O stats into a central system; set SLOs for push latency and web UI median render.
- Capacity headroom: Maintain 30–50% IOPS headroom on the storage hosting live repos.
- Documentation and runbooks: Codify rollback and disaster recovery; include "how to run rebuild safely" with expected durations.
Pitfall Deep Dives
Private Branch Leaks via Unvetted Mirroring
Symptom: A public mirror unexpectedly displays check-ins that should be private. Cause: A mirror job used credentials with private access and pushed to a public repo. Fix: Revoke credentials, shun leaked artifacts, and scrub private content on the mirror; rotate secrets and split publishing roles.
fossil shun add LEAKED_ARTIFACT_ID fossil scrub --private --verily --force fossil rebuild
Stuck Autosync After Proxy Replacement
Symptom: Developers report "protocol error" on commit. Cause: New proxy now buffers request bodies and closes idle connections at 60s. Fix: Increase proxy_read_timeout, disable request buffering, enable sticky sessions, and set autosync off temporarily.
fossil settings autosync off # After proxy change and validation, re-enable fossil settings autosync on
Integrity Errors After Restoring From VM Snapshot
Symptom: "database disk image malformed" on first write. Cause: Snapshot captured mid-transaction. Fix: Run verify, then rebuild. Adopt SQLite backup API for future snapshots.
fossil verify fossil rebuild --stats
CI Overwhelms Server With Many Small Pushes
Symptom: Frequent lock errors during work hours. Cause: Multiple CI pipelines pushing per-commit artifacts and tags. Fix: Batch CI pushes, switch artifacts to unversioned content, and rate-limit write jobs.
fossil uv add build.tar.gz --as ci/builds/123/build.tar.gz fossil uv sync
Security Considerations
Fossil's role-based access must be paired with sane network boundaries. Require HTTPS with modern ciphers, optional mutual TLS for admins, and short-lived tokens for automation. Avoid exposing the writer repo to the internet; put it behind a VPN or zero-trust gateway. For compliance, treat scrub and shun as change-controlled actions with peer review.
Performance Tuning Checklist
- Local SSD with high random write IOPS for the writer repo.
- Proxy buffering disabled for sync; generous timeouts for slow WAN clients.
- Hash policy unified; clients within one or two minor versions.
- Web UI reads served from a read-only backup to isolate query load.
- CI pushes batched; large binaries shipped as unversioned assets.
- Routine "verify" and periodic "rebuild" in maintenance windows.
Concrete Runbooks
Runbook: Emergency Restore
# 1) Identify last good backup sqlite3 backup-2024-12-01.fossil "PRAGMA integrity_check;" # 2) Restore cp backup-2024-12-01.fossil live.fossil # 3) Verify and rebuild fossil verify live.fossil fossil rebuild live.fossil # 4) Re-enable service, then run a test push/pull fossil sync --verbose
Runbook: Proxy Cutover Validation
# Pre-check: baseline against direct backend fossil sync --url https://backend.internal/repo --httptrace # Switch DNS to proxy, then re-test fossil sync --url https://vcs.company/repo --httptrace # Compare traces and latency; adjust proxy timeouts if gaps > 5s
Runbook: Hash Policy Migration
# Prep: ensure all clients upgraded to support SHA3 fossil settings hash-policy sha3 # Rebuild to verify artifacts fossil verify fossil rebuild --stats
Conclusion
Fossil's tight integration and single-file repository yield operational simplicity, but enterprise realities—reverse proxies, security appliances, heavy CI, and snapshot-based backups—introduce failure modes that demand intentional architecture. The surest path to stability is to isolate writes from reads, standardize hash policy and client versions, use SQLite's backup API for consistent copies, tune proxies for long-lived POSTs, and institutionalize privacy hygiene with scrub/shun workflows. Treat the writer repo as a critical database, observe it like any production service, and rehearse rebuild/restore procedures. With these practices, Fossil scales cleanly while retaining its hallmark transparency and auditability.
FAQs
1. How do I safely host Fossil behind a load balancer?
Enable sticky sessions so a sync stays on the same backend, increase read/send timeouts to accommodate large pushes, and disable request buffering for POST bodies. Health-check the backend Fossil process, and keep a single writer repo on fast local storage.
2. When should I run "fossil rebuild" vs "fossil verify"?
"verify" checks artifact integrity and references with minimal disruption, suitable for routine audits. Run "rebuild" after file-system anomalies, shun/scrub operations, or schema upgrades; schedule it during maintenance windows due to heavy I/O.
3. Can I use network file systems for the live repository?
It's risky. NFS/SMB locking and latency can cause "database is locked" or worse. If you must, use a local writer with periodic backups to NFS, and serve the web UI from a read-only copy on the network storage.
4. How do I prevent private data leaks during mirroring?
Restrict mirror credentials to read-only public data, audit private artifacts regularly, and integrate "scrub" before publishing mirrors. Consider a dedicated export repo that never receives private branches or tickets.
5. Why do older clients fail to sync after we changed the hash policy?
They likely do not understand SHA3-256 references. Standardize client versions and enforce an upgrade policy; meanwhile, set clear server messages guiding users to update before enabling strict SHA3-only operation.