Platform.sh Troubleshooting at Scale: Architecture-Aware Diagnostics and Fixes

Details: Category: Cloud Platforms and Services; By Mindful Chase; 21.Aug; Hits: 243

Enterprise teams adopt Platform.sh to standardize build, deploy, and operate workflows across many apps and microservices. Yet day-to-day troubleshooting on this opinionated, immutable infrastructure can be deceptively complex. The most costly issues surface under scale: deployments that stall due to failing hooks, intermittent 502s behind the edge router, run-time write errors on read-only filesystems, and puzzling performance regressions after innocuous YAML edits. This article dissects those problems with architecture-aware diagnostics, root-cause patterns, and production-grade remediations tailored for senior engineers and decision-makers who require reproducible fixes, predictable costs, and operational resilience on Platform.sh.

Mindful Chase

Writing Code, Writing Stories

tbd

Experience

tbd

More to Explore

Background: How Platform.sh Shapes Your Troubleshooting Model

Build, Deploy, Runtime: Three Distinct Phases

Platform.sh enforces a clean separation: the build phase assembles artifacts in an immutable image; the deploy phase runs hooks and connects services; the runtime phase serves traffic in read-only application containers with specific writeable mounts. Most problems arise when assumptions bleed between phases, such as writing to the filesystem at runtime where only mounts are writable.

Composable Architecture: Apps, Services, Relationships

Applications are defined in .platform.app.yaml and connect to managed services (databases, Redis, Kafka, Solr) defined in .platform/services.yaml. Runtime credentials are injected via relationships and environment variables. Misaligned relationship names, version mismatches, and incorrect health checks are frequent root causes of failed deployments and timeouts.

Edge Router, Routes, and Caching

Traffic flows through a global edge layer based on routes.yaml. Caching, redirects, headers, and upstream timeouts live here. A single YAML typo can produce redirect loops, broken TLS termination, or bypassed cache. Since routes are applied atomically during deploy, mistakes affect the entire environment instantly.

Symptoms That Matter in Enterprise Contexts

Long or stuck deployments associated with failing build/deploy hooks.
502/503/504 responses during traffic spikes or after a routes change.
Write failures and Read-only file system errors on runtime.
Unexplained latency increases due to container memory pressure or worker starvation.
Schedule drift or missed jobs because cron commands assume interactive shells.
Data integrity incidents after schema changes during rolling releases.

Architecture-Aware Mental Models

Immutability and Mounts

The application filesystem is immutable after build. Only defined mounts are writable at runtime and persist across deployments. Build-time caches live outside the runtime and must be explicitly configured. Understand this model to avoid cache-invalidating deployments and runtime write errors.

Atomic, Branch-Driven Environments

Each Git branch maps to an environment with its own services and credentials. Promotions and merges trigger full build and deploy cycles. Diagnostics should always be correlated with the specific Git SHA and environment ID to avoid chasing phantom issues across branches.

Service Contracts via Relationships

Your app does not know hostnames or passwords a priori. It receives them through relationships exposed as environment variables or JSON. Schema or client library assumptions must respect the versions specified in services.yaml. Upgrades require coordinated changes in both service definitions and application code.

Diagnostics: A Systematic Playbook

1) Baseline the Environment

Start by enumerating the precise commit, environment, app container, and service versions. Retrieve effective configuration and variables. Validate mounts and their sizes.

# Identify environment and commit
echo $PLATFORM_BRANCH
echo $PLATFORM_TREE_ID

# Inspect relationships
echo $PLATFORM_RELATIONSHIPS | base64 -d | jq .

# Check mounts
df -h
mount

# Confirm runtime variables
env | sort

2) Inspect Activities and Logs

Deployment problems usually leave a detailed activity trail. Capture build and deploy hook outputs, then correlate timestamps with application logs and the edge router logs. Pay special attention to hook step boundaries.

# Tail recent application logs
platform log --app app

# Retrieve the latest activity output
platform activity:list --state complete --limit 5
platform activity:log ACTIVITY_ID

3) SSH for Runtime Forensics

Shell into the running container to verify process health, memory usage, open file descriptors, and lock files that can prevent rolling deploy success. Confirm your app binds to the expected port and that health endpoints respond.

# Open a shell
platform ssh

# Process and memory snapshot
ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%mem | head -20
free -m
ulimit -n

# Health and port checks
curl -sS localhost:$PORT/healthz -I
ss -ltnp

4) Route-Level Troubleshooting

Misconfigured routes.yaml manifests as redirect loops, forced downloads for static files, or sudden cache misses. Verify the computed route map and whether upstream timeouts align with your app's p95 latency.

# Dump effective routes
cat .platform/routes.yaml

# Validate response behavior
curl -I https://YOUR-ROUTE/
curl -I https://YOUR-ROUTE/some/static.png

5) Service Connectivity and Credentials

When apps fail during boot, it's often due to misnamed relationships or wrong client URIs. Decode relationship JSON and construct DSNs carefully. Rotate credentials after service version bumps or restores.

# Example: build DSN for PostgreSQL from relationships
export REL=$(echo $PLATFORM_RELATIONSHIPS | base64 -d)
python - <<PY
import os, json
rel=json.loads(os.environ["PLATFORM_RELATIONSHIPS"])
pg=rel["database"][0]
dsn=f"postgresql://{pg['username']}:{pg['password']}@{pg['host']}:{pg['port']}/{pg['path']}"
print(dsn)
PY

Common Pitfalls and Their Root Causes

Runtime Writes to Immutable Paths

Writing to project root or build-time paths causes intermittent failures that only appear after deployment, when the filesystem becomes read-only. This often hides behind libraries that expect /tmp or user directories. The fix is to declare mounts and point libraries to those paths via environment variables.

Bloated Builds and Disk Pressure

Large node_modules, vendor directories, or oversized Docker-style assets can exceed the build or disk quota, leading to incomplete deployments, slow cold builds, and frequent GC pressure. Artifact pruning and dependency deduplication are mandatory at scale.

Relationship Name Drift

Renaming a relationship in services.yaml without updating code breaks boot. Because the failure appears as a generic connection error, teams can misdiagnose it as a network or credential issue. Always change code and YAML in a single PR with a coordinated deploy plan.

Routes: Redirect Loops and Overlapping Prefixes

Multiple routes with overlapping prefixes and conflicting redirects produce loops that pop up only after the atomic switch. These errors evade local tests because the router behavior is tightly coupled to Platform.sh's edge layer.

Workers and Cron: Undersized or Implicit

Background processing built into the web dyno can starve HTTP workers and trigger 502s under load. Cron tasks that rely on login shells or interactive environment setup can fail silently. Separate workers and make cron commands explicit.

Step-by-Step Fixes With Production Rigor

1) Define Mounts Explicitly and Redirect Writes

Declare durable, writable paths in .platform.app.yaml. Point libraries and frameworks to these mounts via env vars. Audit app code for implicit writes.

# .platform.app.yaml (snippet)
name: app
type: "php:8.3"
disk: 2048
mounts:
  "/var":
    source: "local"
    source_path: "var"
  "/public/uploads":
    source: "local"
    source_path: "uploads"
runtime:
  extensions: ["redis", "pdo_pgsql"]
variables:
  env:
    APP_TMP: "/var/tmp"
    APP_UPLOADS: "/public/uploads"

2) Optimize Build Hooks for Repeatability and Speed

Move expensive steps to build hooks and cache aggressively. Use deterministic flags for Composer and npm to avoid network-variance and non-reproducible lockfiles.

# .platform.app.yaml (hooks)
hooks:
  build: |
    set -euxo pipefail
    composer install --no-dev --prefer-dist --no-interaction --optimize-autoloader
    npm ci
    npm run build
  deploy: |
    set -euxo pipefail
    php bin/console cache:warmup
    php bin/console doctrine:migrations:migrate --no-interaction

3) Composer and npm Caching

Cache directories across builds to prevent redundant downloads. For very large monorepos, consider partial installs and build artifacts to keep runtime lean.

# .platform.app.yaml (caches)
variables:
  env:
    COMPOSER_CACHE_DIR: "/var/cache/composer"
    NPM_CONFIG_CACHE: "/var/cache/npm"
mounts:
  "/var/cache":
    source: "local"
    source_path: "cache"

4) Right-Size Concurrency

Balance worker count with CPU and memory quotas to reduce tail latency and OOM churn. Tune PHP-FPM, Node.js clustering, or Python WSGI workers with explicit limits.

# PHP-FPM tuning via environment variables
variables:
  env:
    PHP_FPM_PM: "dynamic"
    PHP_FPM_MAX_CHILDREN: "12"
    PHP_FPM_START_SERVERS: "3"
    PHP_FPM_MIN_SPARE_SERVERS: "2"
    PHP_FPM_MAX_SPARE_SERVERS: "6"

# Node.js cluster (index.js)
const cluster = require("cluster");
const http = require("http");
const os = require("os");
const cpus = Math.max(2, Math.min(8, os.cpus().length));
if (cluster.isPrimary) {
  for (let i = 0; i < cpus; i++) cluster.fork();
} else {
  const port = process.env.PORT || 8080;
  http.createServer((req,res)=>{ res.end("ok"); }).listen(port);
}

5) Separate Web and Worker Roles

Run workers as a second app to isolate CPU and memory from the web tier. Declare explicit relationships and queues, and render a minimal runtime image for each role.

# .platform/applications.yaml (two apps)
applications:
  - name: web
    type: "php:8.3"
    relationships:
      redis: "cache:redis"
  - name: worker
    type: "php:8.3"
    relationships:
      redis: "cache:redis"
    web: false
services:
  cache:
    type: redis:7.2

6) Make Cron Deterministic

Define cron jobs with explicit commands, no reliance on shell dotfiles, and idempotent logic. Add application-level locking to avoid duplicate runs after deploys or wake-ups.

# .platform.app.yaml (crons)
crons:
  nightly:
    spec: "0 2 * * *"
    cmd: "php bin/console app:report:generate --no-interaction"
  queue:
    spec: "*/2 * * * *"
    cmd: "php bin/console messenger:consume async --time-limit=110"

7) Bulletproof Database Migrations

Never run destructive migrations inline with a traffic switch. Use feature toggles, online schema change tools, or phased rollouts with backward-compatible schemas. Place migrations in deploy hooks but gate them with environment checks.

# Safe migration wrapper
hooks:
  deploy: |
    set -euo pipefail
    if [ "${PLATFORM_ENVIRONMENT_TYPE:-}" = "production" ]; then
      php bin/console doctrine:migrations:migrate --allow-no-migration --no-interaction
    else
      php bin/console doctrine:migrations:migrate --no-interaction
    fi

8) Route Hygiene: Timeouts, Headers, and Caching

Set realistic upstream timeouts and explicit caching behavior. Prevent redirect loops by centralizing canonical host redirects into a single highest-priority route entry.

# .platform/routes.yaml
https://www.{default}/:
  type: upstream
  upstream: "app:http"
  cache:
    enabled: true
    headers: ["Accept", "Authorization"]
    default_ttl: 600
  upstream_timeout: 30
  redirects:
    insecure:
      strict: true
https://{default}/:
  type: redirect
  to: "https://www.{default}/"

9) Slim Down the Runtime Image

Ship only the assets needed to serve requests. Exclude tests, docs, and dev dependencies from the runtime. Large images degrade deploy time and increase memory pressure.

# composer.json excerpt
{
  "scripts": {
    "post-install-cmd": ["composer dump-autoload -o"]
  },
  "config": {
    "platform": {"php": "8.3.0"},
    "preferred-install": "dist"
  },
  "require-dev": { }
}

10) Observability: Labels and Golden Signals

Instrument latency, throughput, error rate, and saturation. Add app-level labels with environment and commit info so you can correlate metrics with deployments. Anomalies often trace back to configuration churn.

# Example log label injection (PHP)
$context = [
  "env" => getenv("PLATFORM_ENVIRONMENT"),
  "commit" => getenv("PLATFORM_TREE_ID"),
];
$logger->info("request_end", $context);

Deep Dives Into Tricky Failures

Failure: Deployment Stuck on Deploy Hook

Symptoms: Activity log hangs at a deploy hook; traffic not switched. Root cause: Non-idempotent scripts, missing exit codes, or interactive prompts. Fix: Make scripts non-interactive, enable set -euo pipefail, and redirect verbose output to logs.

# Hardened deploy hook
hooks:
  deploy: |
    set -euo pipefail
    php bin/console app:warmup --no-interaction --verbose || { echo "Warmup failed"; exit 1; }

Failure: 502s After Route Changes

Symptoms: Users see 502/504 immediately after deploying new routes. Root cause: Upstream not reachable on the expected port, timeout too strict, or circular redirects. Fix: Verify app listens on $PORT, ensure single canonical redirect, and bump upstream_timeout above p99 latency.

# Ensure app binds to $PORT (Node)
const port = process.env.PORT || 8080;
app.listen(port, () => console.log(`listening ${port}`));

Failure: Write Errors in Production

Symptoms: Exceptions citing read-only filesystem when generating thumbnails or cache files. Root cause: Libraries write to default OS paths. Fix: Redirect to declared mounts via env vars and runtime config.

# Laravel example (config/filesystems.php)
"disks" => [
  "local" => [
    "driver" => "local",
    "root" => env("APP_UPLOADS", "/public/uploads"),
  ],
]

Failure: Random Connection Resets to DB

Symptoms: Intermittent DB connection failures under bursty load. Root cause: Excessive connection pools or long-lived idle connections exceeding service limits. Fix: Cap pool size, lower idle timeouts, and reuse connections.

# Doctrine DBAL (Symfony)
doctrine:
  dbal:
    connections:
      default:
        options:
          pool_size: 10
          server_version: "16"
        driver: pdo_pgsql
        url: "%env(resolve:DATABASE_URL)%"

Failure: Excessive Build Times After Adding a Frontend

Symptoms: Build time jumps from 4 to 20 minutes. Root cause: No caching and large artifact generation in app container. Fix: Use npm ci, cache directories, and move frontend to a separate build step that outputs a minimal artifact.

# Build frontend into /public only
hooks:
  build: |
    npm ci
    npm run build
    rm -rf node_modules
    find public/assets -type f -name "*.map" -delete

Performance Engineering on Platform.sh

Control the Tail: p95 and p99

Upstream timeouts and autoscaling rules are ineffective if the application consistently pushes p99 latency beyond the edge timeout. Apply back-pressure and circuit breakers in code, and bleed off non-critical work to workers. Measure cold paths separately from hot cache hits.

Memory Pressure and GC

In memory-constrained containers, GC churn ruins latency. Reduce object allocation in hot loops, tune JIT or PHP opcache accordingly, and eliminate duplicate caches. A quick ps plus smem snapshot during peak reveals whether workers or background tasks dominate RSS.

Static Asset Strategy

Serve immutable, hashed assets with long TTLs via routes configuration and build-time hashing. Avoid sending cache-busting query strings; rely on content hashes in filenames.

# Example static route with long TTL
https://www.{default}/assets/:
  type: upstream
  upstream: "app:http"
  cache:
    enabled: true
    default_ttl: 31536000
  headers:
    cache-control: "public, max-age=31536000, immutable"

Security and Compliance Considerations Affecting Operations

Secrets and Principle of Least Privilege

Use Platform.sh variables and relationships rather than embedding secrets in code or build logs. Rotate credentials during service upgrades or incident response. Ensure logs never echo secrets via set -x.

# Define sensitive variables (redacted in logs)
variables:
  env:
    APP_KEY: "<generated>"
    MAILER_DSN: "smtp://user:pass@host:587"

Transport Security and Headers

Enforce HTTPS, HSTS, and modern security headers at the route level. This reduces application bloat and centralizes policy. Incorrect header ordering can break caching; set them intentionally.

# Security headers in routes
https://www.{default}/:
  type: upstream
  upstream: "app:http"
  headers:
    strict-transport-security: "max-age=31536000; includeSubDomains; preload"
    x-frame-options: "SAMEORIGIN"
    x-content-type-options: "nosniff"

Testing and Release Practices That Prevent Firefighting

Environment Parity

Keep dev, staging, and prod aligned on service versions and route policies. Use branch-based environments to validate migrations and routes before merging. Drift is the enemy of predictable releases.

Contract Tests for Relationships

Write tests that assert the presence and structure of relationship JSON. Break the build if required keys are missing. This prevents late surprises during deploy.

# Assert relationship keys (bash + jq)
test -n "$PLATFORM_RELATIONSHIPS"
echo $PLATFORM_RELATIONSHIPS | base64 -d | jq -e ".database[0] | has(\"host\") and has(\"password\") and has(\"port\")"

Feature Flags and Dark Launches

Gate risky features with flags so you can decouple deploy from release. Dark launch endpoints, then progressively enable features to targeted cohorts. Rollback becomes a configuration flip, not a redeploy.

Operational Runbooks and SRE Alignment

Golden Path Runbook

Define a minimal set of commands for triage: activity list, last deploy log, routes dump, health checks, service stats. Keep this runbook versioned with the repo and require on-call engineers to know it cold.

# quick.sh
platform activity:list --limit 3
platform activity:log $(platform activity:list --limit 1 --property id)
curl -sS https://YOUR-ROUTE/healthz -I
platform log --app app

Error Budgets and SLIs

Define SLIs for availability, latency, and deploy success rate. Tie error budgets to release velocity. If deploy success rate dips below threshold, freeze feature merges and fix build/deploy reliability first.

Long-Term Best Practices

Keep YAML small and explicit: Fewer surprises, easier reviews, faster incident response.
Separate concerns: Distinct apps for web, workers, and scheduled jobs.
Make hooks idempotent: Rerunning should never corrupt data or leave partial states.
Observability by default: Include environment and commit labels in all logs and metrics.
Guardrails: Pre-merge checks for routes, relationships, and cache sizes.
Capacity planning: Review disk and memory usage monthly; tune worker counts based on real traffic.

Conclusion

Platform.sh provides powerful guardrails for repeatable builds and atomic deploys, but those same guardrails surface unique failure modes when teams misapply phases, misconfigure routes, or let relationship contracts drift. Senior practitioners can eliminate most firefighting by internalizing Platform.sh's immutable model, codifying mounts and caching, separating roles for web and workers, and hardening hooks. Pair that with production-grade observability and disciplined release practices, and Platform.sh becomes a reliable substrate for complex portfolios instead of an operational enigma.

FAQs

1. How do I safely run database migrations without causing downtime?

Make migrations backward-compatible, run them in deploy hooks with explicit non-interactive flags, and guard destructive changes behind feature toggles. Test the exact SQL against a staging environment with the same service versions and data shape before production.

2. Why do I get read-only filesystem errors after a successful deploy?

Because the application image is immutable at runtime; only declared mounts are writable. Redirect temp files, caches, and uploads to mounts via env vars and library configs to prevent sporadic failures.

3. What causes sudden 502s after changing routes.yaml?

Common culprits include circular redirects, upstream not listening on $PORT, or upstream timeouts set below p99 latency. Validate route precedence, ensure a single canonical redirect target, and raise timeouts to match measured service behavior.

4. How do I reduce build times for a polyglot monorepo?

Enable per-language caches, split web and worker apps, and produce minimal artifacts by pruning dev dependencies and source maps. Consider building frontend assets in a dedicated step and copying only hashed assets into the app image.

5. What's the fastest way to debug a failed deploy?

Fetch the activity log for the failed step, SSH into the environment to verify process health and mounts, and decode relationships to confirm credentials. Use a hardened runbook that collects these signals consistently to avoid ad hoc investigations.

Contact Us