Background and Context
What makes Platform.sh unique in enterprise settings
Platform.sh coordinates a full lifecycle: build containers, deploy hooks, immutable runtime images, and service clusters declared in YAML. This deterministic model reduces configuration drift and helps enforce reproducibility. At scale, a single mis-specified relationship, missing route upstream, or resource ceiling can cascade into widespread outages across many projects.
Typical enterprise failure signatures
- Deploy hook failures that succeed in small test branches but fail under production artifact sizes.
- 502 or 503 at edge due to route misalignment, health probe timeouts, or process boot delays.
- Persistent connection errors between application containers and backing services because of incorrect
relationships
or TLS modes. - Spiky CPU or memory use caused by worker mis-sizing, cron contention, or cache stampedes.
- Slow builds triggered by redundant dependency resolution, cache invalidation, or NPM lockfile drift.
Architecture and Control Plane Fundamentals
The build→deploy→runtime pipeline
Platform.sh executes a build phase to assemble an immutable image, then a deploy phase to run post-build migrations and warmups, and finally a runtime phase that serves traffic. Understanding what is permitted in each stage is essential: network egress is constrained in build, while deploy has live service relationships but still runs before traffic cutover. Runtime is read-only within the container image except for the declared mounts
.
Key YAML descriptors
Three files govern most behavior:
.platform.app.yaml
: application container definition, hooks, relationships, mounts, web process, crons..platform/services.yaml
: managed service definitions (e.g., MySQL, PostgreSQL, Redis, Elasticsearch, RabbitMQ)..platform/routes.yaml
: inbound HTTP routing, caching, redirects, and upstream mapping to an app'sweb
section.
Resource isolation and scaling implications
Each app container has explicit CPU, RAM, and persistent disk allocations. For multi-tenant or microfrontend architectures, plans must reflect peak concurrency plus headroom for deploy hooks, message workers, and batched crons. Under-provisioned memory or disk surfaces as OOM kills, failed composer installs, or migrations that halt mid-flight.
Diagnostics: A Systematic Workflow
Step 1: Capture environment state and recent activities
Use the CLI to snapshot state before making changes. This provides a reproducible trail that informs root cause analysis.
platform environment:info platform activity:list --state=in_progress,completed,failed --limit=20 platform env:var:list platform services platform routes platform ssh -- php -v platform ssh -- printenv | sort
Step 2: Inspect logs across layers
Aggregate logs by service and app. Focus on deploy hook stderr, web access logs, and database connection errors that correlate with traffic spikes or configuration changes.
platform logs --type=access platform logs --type=error platform logs --type=deploy platform logs --type=php platform logs --type=app platform ssh -- tail -n 200 /var/log/app.log
Step 3: Validate relationships and credentials
Incorrect relationship keys or endpoints are a top cause of broken connections. Confirm relationships exist and verify the injected environment variables inside the running container.
platform relationships platform ssh -- cat /run/config.json | jq .relationships platform ssh -- env | grep PLATFORM_RELATIONSHIPS
Step 4: Profile resource contention
When errors are load-related, measure CPU, memory, and disk IOPS. Look for OOM kills, swap thrash, or builds exceeding disk quotas.
platform ssh -- free -m platform ssh -- df -h platform ssh -- ps -eo pid,comm,%mem,%cpu --sort=-%mem | head platform ssh -- dmesg | tail -n 100
Step 5: Reproduce and isolate
Recreate the failure in a temporary branch using the same routes and services. Use platform mount:download
to reproduce with real data, and rerun hooks with verbose output.
git checkout -b hotfix/repro git push platform HEAD:hotfix/repro platform environment:activate platform redeploy --verbose
Common Failure Modes and Root Causes
1. Deploy hook fails with composer or npm out of memory
Symptoms: deploy activity fails, logs show ENOMEM
, JavaScript heap out of memory
, or Composer killed by OOM. Often coincides with larger lockfiles or monorepo growth.
Root cause: insufficient memory during deploy; mixing build-time dependency resolution into deploy; unnecessary dev dependencies installed in production.
Detection: compare .platform.app.yaml
hooks. If deploy
includes yarn or composer install, memory demand spikes alongside app boot.
2. 502/503 responses after successful deploy
Symptoms: edge returns 502/503, while activity shows success. Access logs reveal health probe failures or app not listening on expected port.
Root cause: mismatch between routes upstream and web: commands
; long warmups in deploy
delaying readiness; application binding to localhost instead of the provided socket.
3. Cannot connect to database or cache
Symptoms: ECONNREFUSED
, SSL routines
errors, or timeouts during migrations.
Root cause: wrong relationship key, unhandled TLS requirement, exceeding connection limits, or migrations running before service is ready.
4. Slow builds on feature branches
Symptoms: build phase takes minutes longer than main branch, even for small diffs.
Root cause: cache busting due to moving lockfiles, unpinned base images, unnecessary asset pipelines executed on every build, or large mounts
copied into build.
5. Cron or worker starvation
Symptoms: jobs miss SLAs, backlogs grow in queues, or batch windows overrun into peak traffic.
Root cause: crons scheduled with overlapping runtimes, long-lived workers sharing the same container resources as the web process, or insufficient concurrency.
Hands-on Diagnostics and Fixes
Fixing deploy OOM for Node or Composer
Shift dependency resolution to build, reduce dev artifacts, and tune memory flags. Ensure production installs exclude development dependencies and leverage deterministic lockfiles.
# .platform.app.yaml build: flavor: nodejs commands: - corepack enable - yarn install --frozen-lockfile --network-timeout 600000 - yarn build - rm -rf node_modules/.cache deploy: commands: - php artisan migrate --force || true
For Node builds that still exceed memory, raise the V8 memory ceiling.
export NODE_OPTIONS=--max-old-space-size=4096 yarn build
For Composer, avoid dev dependencies and increase process limits.
composer install --no-dev --prefer-dist --no-progress --no-interaction php -d memory_limit=-1 bin/console cache:warmup
Eliminating 502/503 after deploy
Confirm the app binds to the provided socket and that routes target the correct upstream. Validate health checks by observing app startup time under deploy pressure.
# .platform.app.yaml (PHP example) web: commands: start: | php-fpm -F locations: "/": root: "public" index: ["index.php"] scripts: true # routes.yaml https://{default}/: type: upstream upstream: "app:http"
For Node, ensure your HTTP server listens on 0.0.0.0:$PORT
and not a hardcoded port.
// server.js const http = require("http"); const port = process.env.PORT || 8080; const host = "0.0.0.0"; http.createServer((req, res) => { res.end("ok"); }).listen(port, host, () => console.log(`listening on ${host}:${port}`));
Repairing broken relationships
Align relationship keys across the app and services. Test TLS and credentials by reading the injected JSON payload.
# .platform/services.yaml db: type: postgresql:14 disk: 2048 cache: type: redis:6 configuration: maxmemory-policy: volatile-lru # .platform.app.yaml relationships: database: "db:postgresql" redis: "cache:redis"
Inside the container, check the decoded relationships and attempt a direct connection.
platform ssh -- php -r "echo json_encode(json_decode(getenv(\"PLATFORM_RELATIONSHIPS\"), true), JSON_PRETTY_PRINT);" platform ssh -- psql "$DATABASE_URL" -c "select now();" platform ssh -- redis-cli -h $REDIS_HOST -p $REDIS_PORT -a $REDIS_PASSWORD ping
Accelerating slow builds
Pin versions, cache dependencies, and ensure build steps are idempotent. Move expensive tasks into artifacts that do not rebuild on every small change.
# Node: cache and deterministic installs yarn install --frozen-lockfile --prefer-offline # PHP: Composer cache composer install --no-dev --prefer-dist --no-progress --no-interaction composer clear-cache # .platform.app.yaml artifacts variables: env: APP_ENV: "prod" build: commands: - yarn install --frozen-lockfile - yarn build - mv build public/build mounts: "public/files": source: storage source_path: files
Unblocking cron and worker starvation
Separate workers into dedicated app containers or scale concurrency with environment variables. Stagger crons to avoid thundering herds.
# .platform.app.yaml (workers) workers: queue-worker: commands: start: "php artisan queue:work --sleep=3 --tries=1" crons: nightly-maintenance: spec: "0 2 * * *" commands: start: "php artisan app:maint --no-interaction" staggered-warmup: spec: "*/7 * * * *" commands: start: "curl -fsS https://{default}/health || true"
Advanced Pitfalls and How to Avoid Them
Immutable image misconceptions
Teams sometimes write to the application filesystem at runtime, which silently fails after redeploy because those changes are not persisted. Declare explicit mounts
for anything requiring writes, such as user uploads or runtime caches. Keep mounts lean to reduce I/O and backup volume.
Route caching hazards
Advanced routing rules with long cache
TTLs can serve stale content after a hotfix. Prefer short TTLs during incident response and add purge paths for high-churn endpoints like search results or personalized dashboards.
Service plan drift
As data grows, PostgreSQL and Elasticsearch plans must be revisited. If vacuum, checkpoint, or segment merges saturate IOPS, query latency spikes. Monitor bloat and reindex schedules; allocate more RAM to accommodate working sets.
Database migrations in deploy
Migrations run in deploy
before traffic cutover. Long migrations extend deploy time and risk lock contention. Break large migrations into multiple steps, add online migration patterns, and back up before schema changes.
Multi-application routing
When routing between multiple apps (e.g., frontend
and api
), ensure each upstream advertises a valid health endpoint and termination behavior. Cross-app assumptions about cookies, headers, and HSTS can produce redirect loops.
Step-by-Step Incident Playbooks
Playbook A: 503s after redeploy
- Run
platform activity:list
and confirm the last deploy activity and duration. - Check
platform logs --type=error --type=deploy
for stack traces and boot delays. - Validate
routes.yaml
and ensure upstream points toapp:http
. - Open an SSH session, verify the process is listening on
$PORT
and respond to/health
. - Redeploy with increased verbosity; if warmup is slow, move it from web start to a background worker or cron.
Playbook B: Database timeouts during traffic spikes
- Inspect connection counts and slow query logs; reduce ORM eager loading and batch writes.
- Enable application-level connection pooling and ensure idle timeouts match traffic behavior.
- Add Redis caching for hot reads; tune TTLs to balance freshness and load.
- Increase DB service plan temporarily; schedule a capacity review to right-size long term.
Playbook C: Build times balloon on feature branches
- Compare lockfiles and base images; pin versions and prune dev-only steps.
- Cache dependencies; avoid copying large mounts into build context.
- Split monolithic build into stages and reuse artifacts across apps where possible.
Configuration Patterns and Anti-patterns
Pattern: Least-privilege relationships
Expose only the services an app needs. This reduces credential surface area and simplifies rotation. For microservices, resist the urge to centralize all relationships in a single app.
Pattern: Explicit health endpoints
Provide a fast, dependency-light /health
that exercises minimal code paths and reports readiness. Avoid querying databases for health unless a strong requirement exists.
Anti-pattern: Doing heavy work in web start
Weighty cache warmups or migrations in web: commands.start
block readiness and degrade reliability. Move these duties into deploy hooks or crons, and cap runtime.
Anti-pattern: Overloading a single app container
Combining web, workers, and scheduled jobs in one container complicates capacity planning and increases incident blast radius. Prefer separate app containers for independent scaling and isolation.
Security, Secrets, and Compliance Considerations
Secret management via environment variables
Secrets are injected via variables and relationships. Avoid echoing them in logs or storing them in mounts
. Rotate regularly and provide application-level reload mechanics that do not require full redeploys.
TLS and internal service connections
Some services require TLS for internal connections. Validate driver options and certificate handling during startup; fallback to explicit parameters when drivers guess incorrectly. Test via direct CLI connections inside the container.
Audit trails and change governance
Use activities and Git history as your compliance trail. Require peer review for YAML changes, enforce semantic commit messages that reference tickets, and tag releases to accelerate forensic analysis.
Performance Optimization Strategies
Right-size plans by workload profile
Measure p95 latency, queue depth, and GC time. Map these to CPU, memory, and IOPS characteristics rather than generic small/medium/large labels. Periodically re-baseline after schema or code changes.
Cache design
Introduce multi-tier caches: in-process for microsecond access, Redis for distributed sharing, and CDN directives in routes.yaml
for static assets. Guard against stampedes with request coalescing and per-key locking.
Asset delivery and compression
Use long-lived static caching for versioned assets and short TTLs for HTML. Pre-compress assets at build time to reduce CPU load during spikes.
Connection reuse and keep-alive
Enable HTTP keep-alive on upstream calls and monitor connection pools. For database connections, reuse rather than reconnect on every request.
End-to-End Example: Polyglot App with PHP API and Node Frontend
Service topology
A Node frontend serves static assets and proxies API calls to a PHP backend. PostgreSQL and Redis back the API; workers process queues. The goal is fast startup, isolated failure domains, and predictable deployments.
Minimal configuration
# .platform/services.yaml db: type: postgresql:14 disk: 4096 cache: type: redis:6 # .platform/routes.yaml https://{default}/: type: upstream upstream: "frontend:http" cache: enabled: true default_ttl: 600 https://{default}/api: type: upstream upstream: "api:http" cache: enabled: false # .platform.app.yaml (frontend) name: frontend type: nodejs:18 disk: 1024 build: commands: - corepack enable - yarn install --frozen-lockfile - yarn build web: commands: start: "node server.js" mounts: "public/uploads": source: storage source_path: uploads # .platform.app.yaml (api) name: api type: php:8.2 disk: 2048 relationships: database: "db:postgresql" redis: "cache:redis" build: commands: - composer install --no-dev --prefer-dist --no-progress --no-interaction deploy: commands: - php bin/console doctrine:migrations:migrate --no-interaction --allow-no-migration web: commands: start: "php-fpm -F" workers: queue: commands: start: "php bin/console messenger:consume async --time-limit=3600 --memory-limit=192M" crons: clear-cache: spec: "0 3 * * *" commands: start: "php bin/console app:cache:prune"
Verification checklist
- Confirm each upstream's
web
start command is non-blocking and quick. - Ensure
/health
endpoints return within 100 ms and do not hit external dependencies. - Validate Redis and DB credentials via direct CLI probes during deploy.
- Run controlled load tests post-deploy to confirm no OOM or CPU saturation.
Observability and SLO Guardrails
SLOs and burn-rate alerts
Define SLOs for availability and latency. Use burn-rate alerts that project SLO exhaustion windows, enabling early throttling, feature flags, or read-only modes during incidents.
Golden signals
Track latency, traffic, errors, and saturation. Couple Platform.sh logs with application metrics to correlate deploys with performance regressions. Keep dashboards per environment: production, staging, and hotfix branches.
Runbook automation
Codify repetitive recovery actions: cache purge, worker restarts, and route toggles. Store commands in version control so that incident commanders have one source of truth.
Long-term Remediation and Governance
Design for failure domains
Split frontends, APIs, and workers into separate app containers with independent scaling. Introduce circuit breakers between services. Use feature flags to decouple deployment from release.
Dependency and image hygiene
Pin base images and language runtimes. Periodically roll base images to pick up security patches and rebuild caches. Use lockfiles consistently; avoid mixing package managers per language.
Capacity planning and cost controls
Model peak load plus growth. Schedule load tests quarterly and after major feature changes. Mitigate cost spikes by rightsizing mounts, compressing logs, and avoiding unnecessarily large service plans.
Conclusion
Enterprise success on Platform.sh depends on strict separation of concerns across build, deploy, and runtime; precise relationships and routes; and observability that ties activities to end-user outcomes. Most hard issues trace back to a few classes of mistakes: doing heavy work in the wrong phase, underestimating resource ceilings, or mismatching upstream configuration. By applying the diagnostics and patterns outlined here, teams can recover quickly from incidents, eliminate repeat offenders, and lay down a governance foundation that scales across many applications and tenants.
FAQs
1. How do I quickly determine whether a failure is build, deploy, or runtime?
Check the latest activity log to classify the phase. Build errors reference dependency or artifact creation, deploy errors mention hooks and migrations, while runtime issues correlate with edge 502/503 and process readiness failures.
2. What is the safest way to run long database migrations?
Break migrations into smaller steps, favor online or concurrent variants, and run during low traffic windows. Add preflight checks, database backups, and strict timeouts so deploy does not block cutover indefinitely.
3. How can I verify service credentials without exposing secrets?
SSH into the container and use the injected environment variables or decoded relationships JSON for one-time probes. Avoid printing full URIs to shared logs; use targeted CLI checks and redact sensitive components.
4. When should I split workers from the web application?
Split whenever workers can consume substantial CPU or memory, or when backlogs can starve the web process. Isolating workers enables independent scaling, clearer SLOs, and simplified incident response.
5. How do I prevent route misconfigurations from causing outages?
Introduce a canary environment with the same routes and upstreams and automate regression tests against it. Require peer review for routes.yaml
changes and validate health endpoints and cache directives before promoting to production.