Platform.sh Troubleshooting at Scale: From Deploy Failures to 503s, Relationships, and Performance Tuning

Details: Category: Cloud Platforms and Services; By Mindful Chase; 25.Aug; Hits: 302

Platform.sh is a powerful PaaS that streamlines build, deploy, and runtime orchestration for polyglot web applications. In large enterprise estates, however, subtle misconfigurations in YAML, relationships, routes, and hooks can cause intermittent deploy failures, 502 or 503 responses, timeouts, or unbounded costs. This troubleshooting article targets senior engineers and architects who need to diagnose root causes quickly, quantify architectural risks, and design long-term fixes. We will cover detection patterns, failure modes across services and environments, and repeatable recovery procedures that reduce mean time to resolution while preparing teams for safe multi-tenant growth.

Mindful Chase

Writing Code, Writing Stories

tbd

Experience

tbd

More to Explore

Background and Context

What makes Platform.sh unique in enterprise settings

Platform.sh coordinates a full lifecycle: build containers, deploy hooks, immutable runtime images, and service clusters declared in YAML. This deterministic model reduces configuration drift and helps enforce reproducibility. At scale, a single mis-specified relationship, missing route upstream, or resource ceiling can cascade into widespread outages across many projects.

Typical enterprise failure signatures

Deploy hook failures that succeed in small test branches but fail under production artifact sizes.
502 or 503 at edge due to route misalignment, health probe timeouts, or process boot delays.
Persistent connection errors between application containers and backing services because of incorrect relationships or TLS modes.
Spiky CPU or memory use caused by worker mis-sizing, cron contention, or cache stampedes.
Slow builds triggered by redundant dependency resolution, cache invalidation, or NPM lockfile drift.

Architecture and Control Plane Fundamentals

The build→deploy→runtime pipeline

Platform.sh executes a build phase to assemble an immutable image, then a deploy phase to run post-build migrations and warmups, and finally a runtime phase that serves traffic. Understanding what is permitted in each stage is essential: network egress is constrained in build, while deploy has live service relationships but still runs before traffic cutover. Runtime is read-only within the container image except for the declared mounts.

Key YAML descriptors

Three files govern most behavior:

.platform.app.yaml: application container definition, hooks, relationships, mounts, web process, crons.
.platform/services.yaml: managed service definitions (e.g., MySQL, PostgreSQL, Redis, Elasticsearch, RabbitMQ).
.platform/routes.yaml: inbound HTTP routing, caching, redirects, and upstream mapping to an app's web section.

Resource isolation and scaling implications

Each app container has explicit CPU, RAM, and persistent disk allocations. For multi-tenant or microfrontend architectures, plans must reflect peak concurrency plus headroom for deploy hooks, message workers, and batched crons. Under-provisioned memory or disk surfaces as OOM kills, failed composer installs, or migrations that halt mid-flight.

Diagnostics: A Systematic Workflow

Step 1: Capture environment state and recent activities

Use the CLI to snapshot state before making changes. This provides a reproducible trail that informs root cause analysis.

platform environment:info
platform activity:list --state=in_progress,completed,failed --limit=20
platform env:var:list
platform services
platform routes
platform ssh -- php -v
platform ssh -- printenv | sort

Step 2: Inspect logs across layers

Aggregate logs by service and app. Focus on deploy hook stderr, web access logs, and database connection errors that correlate with traffic spikes or configuration changes.

platform logs --type=access
platform logs --type=error
platform logs --type=deploy
platform logs --type=php
platform logs --type=app
platform ssh -- tail -n 200 /var/log/app.log

Step 3: Validate relationships and credentials

Incorrect relationship keys or endpoints are a top cause of broken connections. Confirm relationships exist and verify the injected environment variables inside the running container.

platform relationships
platform ssh -- cat /run/config.json | jq .relationships
platform ssh -- env | grep PLATFORM_RELATIONSHIPS

Step 4: Profile resource contention

When errors are load-related, measure CPU, memory, and disk IOPS. Look for OOM kills, swap thrash, or builds exceeding disk quotas.

platform ssh -- free -m
platform ssh -- df -h
platform ssh -- ps -eo pid,comm,%mem,%cpu --sort=-%mem | head
platform ssh -- dmesg | tail -n 100

Step 5: Reproduce and isolate

Recreate the failure in a temporary branch using the same routes and services. Use platform mount:download to reproduce with real data, and rerun hooks with verbose output.

git checkout -b hotfix/repro
git push platform HEAD:hotfix/repro
platform environment:activate
platform redeploy --verbose

Common Failure Modes and Root Causes

1. Deploy hook fails with composer or npm out of memory

Symptoms: deploy activity fails, logs show ENOMEM, JavaScript heap out of memory, or Composer killed by OOM. Often coincides with larger lockfiles or monorepo growth.

Root cause: insufficient memory during deploy; mixing build-time dependency resolution into deploy; unnecessary dev dependencies installed in production.

Detection: compare .platform.app.yaml hooks. If deploy includes yarn or composer install, memory demand spikes alongside app boot.

2. 502/503 responses after successful deploy

Symptoms: edge returns 502/503, while activity shows success. Access logs reveal health probe failures or app not listening on expected port.

Root cause: mismatch between routes upstream and web: commands; long warmups in deploy delaying readiness; application binding to localhost instead of the provided socket.

3. Cannot connect to database or cache

Symptoms: ECONNREFUSED, SSL routines errors, or timeouts during migrations.

Root cause: wrong relationship key, unhandled TLS requirement, exceeding connection limits, or migrations running before service is ready.

4. Slow builds on feature branches

Symptoms: build phase takes minutes longer than main branch, even for small diffs.

Root cause: cache busting due to moving lockfiles, unpinned base images, unnecessary asset pipelines executed on every build, or large mounts copied into build.

5. Cron or worker starvation

Symptoms: jobs miss SLAs, backlogs grow in queues, or batch windows overrun into peak traffic.

Root cause: crons scheduled with overlapping runtimes, long-lived workers sharing the same container resources as the web process, or insufficient concurrency.

Hands-on Diagnostics and Fixes

Fixing deploy OOM for Node or Composer

Shift dependency resolution to build, reduce dev artifacts, and tune memory flags. Ensure production installs exclude development dependencies and leverage deterministic lockfiles.

# .platform.app.yaml
build:
  flavor: nodejs
  commands:
    - corepack enable
    - yarn install --frozen-lockfile --network-timeout 600000
    - yarn build
    - rm -rf node_modules/.cache
deploy:
  commands:
    - php artisan migrate --force || true

For Node builds that still exceed memory, raise the V8 memory ceiling.

export NODE_OPTIONS=--max-old-space-size=4096
yarn build

For Composer, avoid dev dependencies and increase process limits.

composer install --no-dev --prefer-dist --no-progress --no-interaction
php -d memory_limit=-1 bin/console cache:warmup

Eliminating 502/503 after deploy

Confirm the app binds to the provided socket and that routes target the correct upstream. Validate health checks by observing app startup time under deploy pressure.

# .platform.app.yaml (PHP example)
web:
  commands:
    start: |
      php-fpm -F
  locations:
    "/":
      root: "public"
      index: ["index.php"]
      scripts: true
# routes.yaml
https://{default}/:
  type: upstream
  upstream: "app:http"

For Node, ensure your HTTP server listens on 0.0.0.0:$PORT and not a hardcoded port.

// server.js
const http = require("http");
const port = process.env.PORT || 8080;
const host = "0.0.0.0";
http.createServer((req, res) => {
  res.end("ok");
}).listen(port, host, () => console.log(`listening on ${host}:${port}`));

Repairing broken relationships

Align relationship keys across the app and services. Test TLS and credentials by reading the injected JSON payload.

# .platform/services.yaml
db:
  type: postgresql:14
  disk: 2048
cache:
  type: redis:6
  configuration:
    maxmemory-policy: volatile-lru

# .platform.app.yaml
relationships:
  database: "db:postgresql"
  redis: "cache:redis"

Inside the container, check the decoded relationships and attempt a direct connection.

platform ssh -- php -r "echo json_encode(json_decode(getenv(\"PLATFORM_RELATIONSHIPS\"), true), JSON_PRETTY_PRINT);"
platform ssh -- psql "$DATABASE_URL" -c "select now();"
platform ssh -- redis-cli -h $REDIS_HOST -p $REDIS_PORT -a $REDIS_PASSWORD ping

Accelerating slow builds

Pin versions, cache dependencies, and ensure build steps are idempotent. Move expensive tasks into artifacts that do not rebuild on every small change.

# Node: cache and deterministic installs
yarn install --frozen-lockfile --prefer-offline

# PHP: Composer cache
composer install --no-dev --prefer-dist --no-progress --no-interaction
composer clear-cache

# .platform.app.yaml artifacts
variables:
  env:
    APP_ENV: "prod"
build:
  commands:
    - yarn install --frozen-lockfile
    - yarn build
    - mv build public/build
mounts:
  "public/files":
    source: storage
    source_path: files

Unblocking cron and worker starvation

Separate workers into dedicated app containers or scale concurrency with environment variables. Stagger crons to avoid thundering herds.

# .platform.app.yaml (workers)
workers:
  queue-worker:
    commands:
      start: "php artisan queue:work --sleep=3 --tries=1"
crons:
  nightly-maintenance:
    spec: "0 2 * * *"
    commands:
      start: "php artisan app:maint --no-interaction"
  staggered-warmup:
    spec: "*/7 * * * *"
    commands:
      start: "curl -fsS https://{default}/health || true"

Advanced Pitfalls and How to Avoid Them

Immutable image misconceptions

Teams sometimes write to the application filesystem at runtime, which silently fails after redeploy because those changes are not persisted. Declare explicit mounts for anything requiring writes, such as user uploads or runtime caches. Keep mounts lean to reduce I/O and backup volume.

Route caching hazards

Advanced routing rules with long cache TTLs can serve stale content after a hotfix. Prefer short TTLs during incident response and add purge paths for high-churn endpoints like search results or personalized dashboards.

Service plan drift

As data grows, PostgreSQL and Elasticsearch plans must be revisited. If vacuum, checkpoint, or segment merges saturate IOPS, query latency spikes. Monitor bloat and reindex schedules; allocate more RAM to accommodate working sets.

Database migrations in deploy

Migrations run in deploy before traffic cutover. Long migrations extend deploy time and risk lock contention. Break large migrations into multiple steps, add online migration patterns, and back up before schema changes.

Multi-application routing

When routing between multiple apps (e.g., frontend and api), ensure each upstream advertises a valid health endpoint and termination behavior. Cross-app assumptions about cookies, headers, and HSTS can produce redirect loops.

Step-by-Step Incident Playbooks

Playbook A: 503s after redeploy

Run platform activity:list and confirm the last deploy activity and duration.
Check platform logs --type=error --type=deploy for stack traces and boot delays.
Validate routes.yaml and ensure upstream points to app:http.
Open an SSH session, verify the process is listening on $PORT and respond to /health.
Redeploy with increased verbosity; if warmup is slow, move it from web start to a background worker or cron.

Playbook B: Database timeouts during traffic spikes

Inspect connection counts and slow query logs; reduce ORM eager loading and batch writes.
Enable application-level connection pooling and ensure idle timeouts match traffic behavior.
Add Redis caching for hot reads; tune TTLs to balance freshness and load.
Increase DB service plan temporarily; schedule a capacity review to right-size long term.

Playbook C: Build times balloon on feature branches

Compare lockfiles and base images; pin versions and prune dev-only steps.
Cache dependencies; avoid copying large mounts into build context.
Split monolithic build into stages and reuse artifacts across apps where possible.

Configuration Patterns and Anti-patterns

Pattern: Least-privilege relationships

Expose only the services an app needs. This reduces credential surface area and simplifies rotation. For microservices, resist the urge to centralize all relationships in a single app.

Pattern: Explicit health endpoints

Provide a fast, dependency-light /health that exercises minimal code paths and reports readiness. Avoid querying databases for health unless a strong requirement exists.

Anti-pattern: Doing heavy work in web start

Weighty cache warmups or migrations in web: commands.start block readiness and degrade reliability. Move these duties into deploy hooks or crons, and cap runtime.

Anti-pattern: Overloading a single app container

Combining web, workers, and scheduled jobs in one container complicates capacity planning and increases incident blast radius. Prefer separate app containers for independent scaling and isolation.

Security, Secrets, and Compliance Considerations

Secret management via environment variables

Secrets are injected via variables and relationships. Avoid echoing them in logs or storing them in mounts. Rotate regularly and provide application-level reload mechanics that do not require full redeploys.

TLS and internal service connections

Some services require TLS for internal connections. Validate driver options and certificate handling during startup; fallback to explicit parameters when drivers guess incorrectly. Test via direct CLI connections inside the container.

Audit trails and change governance

Use activities and Git history as your compliance trail. Require peer review for YAML changes, enforce semantic commit messages that reference tickets, and tag releases to accelerate forensic analysis.

Performance Optimization Strategies

Right-size plans by workload profile

Measure p95 latency, queue depth, and GC time. Map these to CPU, memory, and IOPS characteristics rather than generic small/medium/large labels. Periodically re-baseline after schema or code changes.

Cache design

Introduce multi-tier caches: in-process for microsecond access, Redis for distributed sharing, and CDN directives in routes.yaml for static assets. Guard against stampedes with request coalescing and per-key locking.

Asset delivery and compression

Use long-lived static caching for versioned assets and short TTLs for HTML. Pre-compress assets at build time to reduce CPU load during spikes.

Connection reuse and keep-alive

Enable HTTP keep-alive on upstream calls and monitor connection pools. For database connections, reuse rather than reconnect on every request.

End-to-End Example: Polyglot App with PHP API and Node Frontend

Service topology

A Node frontend serves static assets and proxies API calls to a PHP backend. PostgreSQL and Redis back the API; workers process queues. The goal is fast startup, isolated failure domains, and predictable deployments.

Minimal configuration

# .platform/services.yaml
db:
  type: postgresql:14
  disk: 4096
cache:
  type: redis:6

# .platform/routes.yaml
https://{default}/:
  type: upstream
  upstream: "frontend:http"
  cache:
    enabled: true
    default_ttl: 600
https://{default}/api:
  type: upstream
  upstream: "api:http"
  cache:
    enabled: false

# .platform.app.yaml (frontend)
name: frontend
type: nodejs:18
disk: 1024
build:
  commands:
    - corepack enable
    - yarn install --frozen-lockfile
    - yarn build
web:
  commands:
    start: "node server.js"
mounts:
  "public/uploads":
    source: storage
    source_path: uploads

# .platform.app.yaml (api)
name: api
type: php:8.2
disk: 2048
relationships:
  database: "db:postgresql"
  redis: "cache:redis"
build:
  commands:
    - composer install --no-dev --prefer-dist --no-progress --no-interaction
deploy:
  commands:
    - php bin/console doctrine:migrations:migrate --no-interaction --allow-no-migration
web:
  commands:
    start: "php-fpm -F"
workers:
  queue:
    commands:
      start: "php bin/console messenger:consume async --time-limit=3600 --memory-limit=192M"
crons:
  clear-cache:
    spec: "0 3 * * *"
    commands:
      start: "php bin/console app:cache:prune"

Verification checklist

Confirm each upstream's web start command is non-blocking and quick.
Ensure /health endpoints return within 100 ms and do not hit external dependencies.
Validate Redis and DB credentials via direct CLI probes during deploy.
Run controlled load tests post-deploy to confirm no OOM or CPU saturation.

Observability and SLO Guardrails

SLOs and burn-rate alerts

Define SLOs for availability and latency. Use burn-rate alerts that project SLO exhaustion windows, enabling early throttling, feature flags, or read-only modes during incidents.

Golden signals

Track latency, traffic, errors, and saturation. Couple Platform.sh logs with application metrics to correlate deploys with performance regressions. Keep dashboards per environment: production, staging, and hotfix branches.

Runbook automation

Codify repetitive recovery actions: cache purge, worker restarts, and route toggles. Store commands in version control so that incident commanders have one source of truth.

Long-term Remediation and Governance

Design for failure domains

Split frontends, APIs, and workers into separate app containers with independent scaling. Introduce circuit breakers between services. Use feature flags to decouple deployment from release.

Dependency and image hygiene

Pin base images and language runtimes. Periodically roll base images to pick up security patches and rebuild caches. Use lockfiles consistently; avoid mixing package managers per language.

Capacity planning and cost controls

Model peak load plus growth. Schedule load tests quarterly and after major feature changes. Mitigate cost spikes by rightsizing mounts, compressing logs, and avoiding unnecessarily large service plans.

Conclusion

Enterprise success on Platform.sh depends on strict separation of concerns across build, deploy, and runtime; precise relationships and routes; and observability that ties activities to end-user outcomes. Most hard issues trace back to a few classes of mistakes: doing heavy work in the wrong phase, underestimating resource ceilings, or mismatching upstream configuration. By applying the diagnostics and patterns outlined here, teams can recover quickly from incidents, eliminate repeat offenders, and lay down a governance foundation that scales across many applications and tenants.

FAQs

1. How do I quickly determine whether a failure is build, deploy, or runtime?

Check the latest activity log to classify the phase. Build errors reference dependency or artifact creation, deploy errors mention hooks and migrations, while runtime issues correlate with edge 502/503 and process readiness failures.

2. What is the safest way to run long database migrations?

Break migrations into smaller steps, favor online or concurrent variants, and run during low traffic windows. Add preflight checks, database backups, and strict timeouts so deploy does not block cutover indefinitely.

3. How can I verify service credentials without exposing secrets?

SSH into the container and use the injected environment variables or decoded relationships JSON for one-time probes. Avoid printing full URIs to shared logs; use targeted CLI checks and redact sensitive components.

4. When should I split workers from the web application?

Split whenever workers can consume substantial CPU or memory, or when backlogs can starve the web process. Isolating workers enables independent scaling, clearer SLOs, and simplified incident response.

5. How do I prevent route misconfigurations from causing outages?

Introduce a canary environment with the same routes and upstreams and automate regression tests against it. Require peer review for routes.yaml changes and validate health endpoints and cache directives before promoting to production.

Contact Us