Background: How Platform.sh Shapes Your Troubleshooting Model
Build, Deploy, Runtime: Three Distinct Phases
Platform.sh enforces a clean separation: the build phase assembles artifacts in an immutable image; the deploy phase runs hooks and connects services; the runtime phase serves traffic in read-only application containers with specific writeable mounts. Most problems arise when assumptions bleed between phases, such as writing to the filesystem at runtime where only mounts are writable.
Composable Architecture: Apps, Services, Relationships
Applications are defined in .platform.app.yaml
and connect to managed services (databases, Redis, Kafka, Solr) defined in .platform/services.yaml
. Runtime credentials are injected via relationships and environment variables. Misaligned relationship names, version mismatches, and incorrect health checks are frequent root causes of failed deployments and timeouts.
Edge Router, Routes, and Caching
Traffic flows through a global edge layer based on routes.yaml
. Caching, redirects, headers, and upstream timeouts live here. A single YAML typo can produce redirect loops, broken TLS termination, or bypassed cache. Since routes are applied atomically during deploy, mistakes affect the entire environment instantly.
Symptoms That Matter in Enterprise Contexts
- Long or stuck deployments associated with failing build/deploy hooks.
- 502/503/504 responses during traffic spikes or after a routes change.
- Write failures and Read-only file system errors on runtime.
- Unexplained latency increases due to container memory pressure or worker starvation.
- Schedule drift or missed jobs because cron commands assume interactive shells.
- Data integrity incidents after schema changes during rolling releases.
Architecture-Aware Mental Models
Immutability and Mounts
The application filesystem is immutable after build. Only defined mounts are writable at runtime and persist across deployments. Build-time caches live outside the runtime and must be explicitly configured. Understand this model to avoid cache-invalidating deployments and runtime write errors.
Atomic, Branch-Driven Environments
Each Git branch maps to an environment with its own services and credentials. Promotions and merges trigger full build and deploy cycles. Diagnostics should always be correlated with the specific Git SHA and environment ID to avoid chasing phantom issues across branches.
Service Contracts via Relationships
Your app does not know hostnames or passwords a priori. It receives them through relationships exposed as environment variables or JSON. Schema or client library assumptions must respect the versions specified in services.yaml
. Upgrades require coordinated changes in both service definitions and application code.
Diagnostics: A Systematic Playbook
1) Baseline the Environment
Start by enumerating the precise commit, environment, app container, and service versions. Retrieve effective configuration and variables. Validate mounts and their sizes.
# Identify environment and commit echo $PLATFORM_BRANCH echo $PLATFORM_TREE_ID # Inspect relationships echo $PLATFORM_RELATIONSHIPS | base64 -d | jq . # Check mounts df -h mount # Confirm runtime variables env | sort
2) Inspect Activities and Logs
Deployment problems usually leave a detailed activity trail. Capture build and deploy hook outputs, then correlate timestamps with application logs and the edge router logs. Pay special attention to hook step boundaries.
# Tail recent application logs platform log --app app # Retrieve the latest activity output platform activity:list --state complete --limit 5 platform activity:log ACTIVITY_ID
3) SSH for Runtime Forensics
Shell into the running container to verify process health, memory usage, open file descriptors, and lock files that can prevent rolling deploy success. Confirm your app binds to the expected port and that health endpoints respond.
# Open a shell platform ssh # Process and memory snapshot ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%mem | head -20 free -m ulimit -n # Health and port checks curl -sS localhost:$PORT/healthz -I ss -ltnp
4) Route-Level Troubleshooting
Misconfigured routes.yaml
manifests as redirect loops, forced downloads for static files, or sudden cache misses. Verify the computed route map and whether upstream timeouts align with your app's p95 latency.
# Dump effective routes cat .platform/routes.yaml # Validate response behavior curl -I https://YOUR-ROUTE/ curl -I https://YOUR-ROUTE/some/static.png
5) Service Connectivity and Credentials
When apps fail during boot, it's often due to misnamed relationships or wrong client URIs. Decode relationship JSON and construct DSNs carefully. Rotate credentials after service version bumps or restores.
# Example: build DSN for PostgreSQL from relationships export REL=$(echo $PLATFORM_RELATIONSHIPS | base64 -d) python - <<PY import os, json rel=json.loads(os.environ["PLATFORM_RELATIONSHIPS"]) pg=rel["database"][0] dsn=f"postgresql://{pg['username']}:{pg['password']}@{pg['host']}:{pg['port']}/{pg['path']}" print(dsn) PY
Common Pitfalls and Their Root Causes
Runtime Writes to Immutable Paths
Writing to project root or build-time paths causes intermittent failures that only appear after deployment, when the filesystem becomes read-only. This often hides behind libraries that expect /tmp
or user directories. The fix is to declare mounts and point libraries to those paths via environment variables.
Bloated Builds and Disk Pressure
Large node_modules, vendor directories, or oversized Docker-style assets can exceed the build or disk quota, leading to incomplete deployments, slow cold builds, and frequent GC pressure. Artifact pruning and dependency deduplication are mandatory at scale.
Relationship Name Drift
Renaming a relationship in services.yaml
without updating code breaks boot. Because the failure appears as a generic connection error, teams can misdiagnose it as a network or credential issue. Always change code and YAML in a single PR with a coordinated deploy plan.
Routes: Redirect Loops and Overlapping Prefixes
Multiple routes with overlapping prefixes and conflicting redirects
produce loops that pop up only after the atomic switch. These errors evade local tests because the router behavior is tightly coupled to Platform.sh's edge layer.
Workers and Cron: Undersized or Implicit
Background processing built into the web dyno can starve HTTP workers and trigger 502s under load. Cron tasks that rely on login shells or interactive environment setup can fail silently. Separate workers and make cron commands explicit.
Step-by-Step Fixes With Production Rigor
1) Define Mounts Explicitly and Redirect Writes
Declare durable, writable paths in .platform.app.yaml
. Point libraries and frameworks to these mounts via env vars. Audit app code for implicit writes.
# .platform.app.yaml (snippet) name: app type: "php:8.3" disk: 2048 mounts: "/var": source: "local" source_path: "var" "/public/uploads": source: "local" source_path: "uploads" runtime: extensions: ["redis", "pdo_pgsql"] variables: env: APP_TMP: "/var/tmp" APP_UPLOADS: "/public/uploads"
2) Optimize Build Hooks for Repeatability and Speed
Move expensive steps to build hooks and cache aggressively. Use deterministic flags for Composer and npm to avoid network-variance and non-reproducible lockfiles.
# .platform.app.yaml (hooks) hooks: build: | set -euxo pipefail composer install --no-dev --prefer-dist --no-interaction --optimize-autoloader npm ci npm run build deploy: | set -euxo pipefail php bin/console cache:warmup php bin/console doctrine:migrations:migrate --no-interaction
3) Composer and npm Caching
Cache directories across builds to prevent redundant downloads. For very large monorepos, consider partial installs and build artifacts to keep runtime lean.
# .platform.app.yaml (caches) variables: env: COMPOSER_CACHE_DIR: "/var/cache/composer" NPM_CONFIG_CACHE: "/var/cache/npm" mounts: "/var/cache": source: "local" source_path: "cache"
4) Right-Size Concurrency
Balance worker count with CPU and memory quotas to reduce tail latency and OOM churn. Tune PHP-FPM, Node.js clustering, or Python WSGI workers with explicit limits.
# PHP-FPM tuning via environment variables variables: env: PHP_FPM_PM: "dynamic" PHP_FPM_MAX_CHILDREN: "12" PHP_FPM_START_SERVERS: "3" PHP_FPM_MIN_SPARE_SERVERS: "2" PHP_FPM_MAX_SPARE_SERVERS: "6"
# Node.js cluster (index.js) const cluster = require("cluster"); const http = require("http"); const os = require("os"); const cpus = Math.max(2, Math.min(8, os.cpus().length)); if (cluster.isPrimary) { for (let i = 0; i < cpus; i++) cluster.fork(); } else { const port = process.env.PORT || 8080; http.createServer((req,res)=>{ res.end("ok"); }).listen(port); }
5) Separate Web and Worker Roles
Run workers as a second app to isolate CPU and memory from the web tier. Declare explicit relationships and queues, and render a minimal runtime image for each role.
# .platform/applications.yaml (two apps) applications: - name: web type: "php:8.3" relationships: redis: "cache:redis" - name: worker type: "php:8.3" relationships: redis: "cache:redis" web: false services: cache: type: redis:7.2
6) Make Cron Deterministic
Define cron jobs with explicit commands, no reliance on shell dotfiles, and idempotent logic. Add application-level locking to avoid duplicate runs after deploys or wake-ups.
# .platform.app.yaml (crons) crons: nightly: spec: "0 2 * * *" cmd: "php bin/console app:report:generate --no-interaction" queue: spec: "*/2 * * * *" cmd: "php bin/console messenger:consume async --time-limit=110"
7) Bulletproof Database Migrations
Never run destructive migrations inline with a traffic switch. Use feature toggles, online schema change tools, or phased rollouts with backward-compatible schemas. Place migrations in deploy
hooks but gate them with environment checks.
# Safe migration wrapper hooks: deploy: | set -euo pipefail if [ "${PLATFORM_ENVIRONMENT_TYPE:-}" = "production" ]; then php bin/console doctrine:migrations:migrate --allow-no-migration --no-interaction else php bin/console doctrine:migrations:migrate --no-interaction fi
8) Route Hygiene: Timeouts, Headers, and Caching
Set realistic upstream timeouts and explicit caching behavior. Prevent redirect loops by centralizing canonical host redirects into a single highest-priority route entry.
# .platform/routes.yaml https://www.{default}/: type: upstream upstream: "app:http" cache: enabled: true headers: ["Accept", "Authorization"] default_ttl: 600 upstream_timeout: 30 redirects: insecure: strict: true https://{default}/: type: redirect to: "https://www.{default}/"
9) Slim Down the Runtime Image
Ship only the assets needed to serve requests. Exclude tests, docs, and dev dependencies from the runtime. Large images degrade deploy time and increase memory pressure.
# composer.json excerpt { "scripts": { "post-install-cmd": ["composer dump-autoload -o"] }, "config": { "platform": {"php": "8.3.0"}, "preferred-install": "dist" }, "require-dev": { } }
10) Observability: Labels and Golden Signals
Instrument latency, throughput, error rate, and saturation. Add app-level labels with environment and commit info so you can correlate metrics with deployments. Anomalies often trace back to configuration churn.
# Example log label injection (PHP) $context = [ "env" => getenv("PLATFORM_ENVIRONMENT"), "commit" => getenv("PLATFORM_TREE_ID"), ]; $logger->info("request_end", $context);
Deep Dives Into Tricky Failures
Failure: Deployment Stuck on Deploy Hook
Symptoms: Activity log hangs at a deploy hook; traffic not switched. Root cause: Non-idempotent scripts, missing exit codes, or interactive prompts. Fix: Make scripts non-interactive, enable set -euo pipefail
, and redirect verbose output to logs.
# Hardened deploy hook hooks: deploy: | set -euo pipefail php bin/console app:warmup --no-interaction --verbose || { echo "Warmup failed"; exit 1; }
Failure: 502s After Route Changes
Symptoms: Users see 502/504 immediately after deploying new routes. Root cause: Upstream not reachable on the expected port, timeout too strict, or circular redirects. Fix: Verify app listens on $PORT
, ensure single canonical redirect, and bump upstream_timeout
above p99 latency.
# Ensure app binds to $PORT (Node) const port = process.env.PORT || 8080; app.listen(port, () => console.log(`listening ${port}`));
Failure: Write Errors in Production
Symptoms: Exceptions citing read-only filesystem when generating thumbnails or cache files. Root cause: Libraries write to default OS paths. Fix: Redirect to declared mounts via env vars and runtime config.
# Laravel example (config/filesystems.php) "disks" => [ "local" => [ "driver" => "local", "root" => env("APP_UPLOADS", "/public/uploads"), ], ]
Failure: Random Connection Resets to DB
Symptoms: Intermittent DB connection failures under bursty load. Root cause: Excessive connection pools or long-lived idle connections exceeding service limits. Fix: Cap pool size, lower idle timeouts, and reuse connections.
# Doctrine DBAL (Symfony) doctrine: dbal: connections: default: options: pool_size: 10 server_version: "16" driver: pdo_pgsql url: "%env(resolve:DATABASE_URL)%"
Failure: Excessive Build Times After Adding a Frontend
Symptoms: Build time jumps from 4 to 20 minutes. Root cause: No caching and large artifact generation in app container. Fix: Use npm ci
, cache directories, and move frontend to a separate build step that outputs a minimal artifact.
# Build frontend into /public only hooks: build: | npm ci npm run build rm -rf node_modules find public/assets -type f -name "*.map" -delete
Performance Engineering on Platform.sh
Control the Tail: p95 and p99
Upstream timeouts and autoscaling rules are ineffective if the application consistently pushes p99 latency beyond the edge timeout. Apply back-pressure and circuit breakers in code, and bleed off non-critical work to workers. Measure cold paths separately from hot cache hits.
Memory Pressure and GC
In memory-constrained containers, GC churn ruins latency. Reduce object allocation in hot loops, tune JIT or PHP opcache accordingly, and eliminate duplicate caches. A quick ps
plus smem
snapshot during peak reveals whether workers or background tasks dominate RSS.
Static Asset Strategy
Serve immutable, hashed assets with long TTLs via routes configuration and build-time hashing. Avoid sending cache-busting query strings; rely on content hashes in filenames.
# Example static route with long TTL https://www.{default}/assets/: type: upstream upstream: "app:http" cache: enabled: true default_ttl: 31536000 headers: cache-control: "public, max-age=31536000, immutable"
Security and Compliance Considerations Affecting Operations
Secrets and Principle of Least Privilege
Use Platform.sh variables and relationships rather than embedding secrets in code or build logs. Rotate credentials during service upgrades or incident response. Ensure logs never echo secrets via set -x
.
# Define sensitive variables (redacted in logs) variables: env: APP_KEY: "<generated>" MAILER_DSN: "smtp://user:pass@host:587"
Transport Security and Headers
Enforce HTTPS, HSTS, and modern security headers at the route level. This reduces application bloat and centralizes policy. Incorrect header ordering can break caching; set them intentionally.
# Security headers in routes https://www.{default}/: type: upstream upstream: "app:http" headers: strict-transport-security: "max-age=31536000; includeSubDomains; preload" x-frame-options: "SAMEORIGIN" x-content-type-options: "nosniff"
Testing and Release Practices That Prevent Firefighting
Environment Parity
Keep dev, staging, and prod aligned on service versions and route policies. Use branch-based environments to validate migrations and routes before merging. Drift is the enemy of predictable releases.
Contract Tests for Relationships
Write tests that assert the presence and structure of relationship JSON. Break the build if required keys are missing. This prevents late surprises during deploy.
# Assert relationship keys (bash + jq) test -n "$PLATFORM_RELATIONSHIPS" echo $PLATFORM_RELATIONSHIPS | base64 -d | jq -e ".database[0] | has(\"host\") and has(\"password\") and has(\"port\")"
Feature Flags and Dark Launches
Gate risky features with flags so you can decouple deploy from release. Dark launch endpoints, then progressively enable features to targeted cohorts. Rollback becomes a configuration flip, not a redeploy.
Operational Runbooks and SRE Alignment
Golden Path Runbook
Define a minimal set of commands for triage: activity list, last deploy log, routes dump, health checks, service stats. Keep this runbook versioned with the repo and require on-call engineers to know it cold.
# quick.sh platform activity:list --limit 3 platform activity:log $(platform activity:list --limit 1 --property id) curl -sS https://YOUR-ROUTE/healthz -I platform log --app app
Error Budgets and SLIs
Define SLIs for availability, latency, and deploy success rate. Tie error budgets to release velocity. If deploy success rate dips below threshold, freeze feature merges and fix build/deploy reliability first.
Long-Term Best Practices
- Keep YAML small and explicit: Fewer surprises, easier reviews, faster incident response.
- Separate concerns: Distinct apps for web, workers, and scheduled jobs.
- Make hooks idempotent: Rerunning should never corrupt data or leave partial states.
- Observability by default: Include environment and commit labels in all logs and metrics.
- Guardrails: Pre-merge checks for routes, relationships, and cache sizes.
- Capacity planning: Review disk and memory usage monthly; tune worker counts based on real traffic.
Conclusion
Platform.sh provides powerful guardrails for repeatable builds and atomic deploys, but those same guardrails surface unique failure modes when teams misapply phases, misconfigure routes, or let relationship contracts drift. Senior practitioners can eliminate most firefighting by internalizing Platform.sh's immutable model, codifying mounts and caching, separating roles for web and workers, and hardening hooks. Pair that with production-grade observability and disciplined release practices, and Platform.sh becomes a reliable substrate for complex portfolios instead of an operational enigma.
FAQs
1. How do I safely run database migrations without causing downtime?
Make migrations backward-compatible, run them in deploy hooks with explicit non-interactive flags, and guard destructive changes behind feature toggles. Test the exact SQL against a staging environment with the same service versions and data shape before production.
2. Why do I get read-only filesystem errors after a successful deploy?
Because the application image is immutable at runtime; only declared mounts are writable. Redirect temp files, caches, and uploads to mounts via env vars and library configs to prevent sporadic failures.
3. What causes sudden 502s after changing routes.yaml?
Common culprits include circular redirects, upstream not listening on $PORT, or upstream timeouts set below p99 latency. Validate route precedence, ensure a single canonical redirect target, and raise timeouts to match measured service behavior.
4. How do I reduce build times for a polyglot monorepo?
Enable per-language caches, split web and worker apps, and produce minimal artifacts by pruning dev dependencies and source maps. Consider building frontend assets in a dedicated step and copying only hashed assets into the app image.
5. What's the fastest way to debug a failed deploy?
Fetch the activity log for the failed step, SSH into the environment to verify process health and mounts, and decode relationships to confirm credentials. Use a hardened runbook that collects these signals consistently to avoid ad hoc investigations.