The problem in context: long lived Yii workers that bloat and stall
Typical symptoms
Operations dashboards show a steady upward memory trend for queue listeners, occasionally followed by OOM kills. Database pools reveal dozens or hundreds of idle yet open connections from the same worker hosts. Meanwhile, job latency increases over hours of uptime, then drops after a process restart. Error logs intermittently show transaction deadlocks, lock wait timeouts, or transient cache misses that cascade into thundering herds.
Why this matters at scale
In enterprise settings, one failed worker can back up Kafka topics, saturate SQS/Redis/AMQP queues, and ultimately impact SLAs for downstream services. Memory bloat reduces node density and increases cost. Leaked connections degrade database throughput and jeopardize failover. Repeated deadlocks force retry storms that amplify the original incident.
Background: how Yii behaves outside the web request lifecycle
Application lifecycle differences
In web requests, Yii’s application object and most components are created, used, and destroyed per request. In a long lived CLI worker, the same application instance and its components persist across many jobs. Anything you retain in static properties, singletons, or closures can therefore accumulate state across hours or days.
DI container and singleton scoping
Yii’s DI container supports transient definitions and singletons. In a daemon, singletons live as long as the process. Accidentally registering request scoped collaborators—for example a repository that holds references to ActiveRecord models or an external client with large buffers—as singletons produces slow leaks.
ActiveRecord, schema metadata, and PDO
ActiveRecord caches schema metadata to avoid repeated introspection. That is good for performance, yet it also means AR retains class level state. If you couple that with long running queries and many instantiated models per job, object graphs linger in memory until the garbage collector can reclaim them. Unclosed transactions and PDO statements can also keep connections pinned.
Queues, isolation, and forks
The yii2-queue extension can run jobs in the master process or isolate each job in a forked PHP process. Isolation trades a small startup cost for deterministic memory reclamation and clean component state. For truly heavy jobs, isolation is often the single most effective containment strategy.
Reproduce and measure: establish a clean baseline
Minimal controlled environment
Before changing production, reproduce the issue in a staging cluster with the same PHP version, opcache settings, and queue backend. Use production like database sizes and representative payloads. Disable nonessential noise so you can observe clear cause and effect.
Collect the right telemetry
- Per process memory: RSS and heap via ps, smem, and periodic
memory_get_usage(true)
logging. - Open connections: database server views and lsof on the worker PID.
- Per job timings: queue start/finish timestamps, downstream RPC/HTTP timings, and SQL duration.
- GC activity: count of collected cycles and roots after each job.
- Deadlocks/retries: error codes and retry counts per job type.
Synthetic load generator
Create a synthetic producer that enqueues representative jobs at a controlled rate. Vary payload size, batch sizes, and concurrency. The goal is to map how memory and latency respond to volume and to specific job types.
Diagnostics playbook
Step 1: visualize per job growth
Instrument the worker to print memory and connection counts at the end of every job. The simplest graph—memory delta per job—quickly reveals which job classes exhibit monotonic growth versus sawtooth patterns.
\Step 2: watch database connections
If your pool of server side connections grows with worker uptime, you likely have unclosed transactions or orphaned PDO handles. Verify that every transaction path leaves the connection in a clean state and that you are not keeping AR models or Command objects alive across jobs.
Step 3: isolate leaks with GC and snapshots
Force a garbage collection cycle after each job and snapshot heap size. If the heap still increases monotonically, then references are retained somewhere. Inspect global singletons, static caches, and listeners that capture closures with large variables.
\Step 4: examine queries and object churn
Profiles that show millions of created objects per job usually point to AR heavy loops loading rows one by one. Replace eager AR creation with batch/each iterators and streaming transformations to keep live object counts low.
\Step 5: identify long tail retries and deadlocks
Deadlocks and lock wait timeouts appear sporadically and then cluster under load. Add structured logs that capture SQLSTATE codes and number of attempts. This tells you whether retries are the cause of latency spikes and whether backoff is effective.
Common root causes in the wild
AR model retention across jobs
Repositories or services that cache AR instances in static properties or singletons keep entire object graphs alive. A few hundred models per job across thousands of jobs is enough to exhaust memory.
Unreleased PDO statements and transactions
Leaving a transaction open pins server resources and prevents connection reuse. Likewise, keeping a Command object with a large result set in scope prolongs memory retention and socket lifetime.
Mis scoped singletons
Registering connectors as singletons when they should be transient leads to stale sockets, stuck TLS sessions, or giant buffered responses that never shrink. In a web request, this would not matter; in a daemon, it persists indefinitely.
Event listeners and global state that accumulate
Adding listeners to global events each time a job runs without removing them duplicates callbacks. After hours, every event does N calls, each holding references to prior jobs’ data.
Retry storms from deadlocks
Workers that catch deadlock exceptions and immediately retry without jitter synchronize across replicas. What begins as a single conflict turns into a burst of retries that keeps the hot rows hot.
Cache stampedes and stale on miss
When a frequently used key expires, dozens of concurrent jobs rebuild it simultaneously, all hammering the database. Without mutexes or soft TTLs, cache is not a protective layer; it amplifies load.
Step by step fixes
1) Contain the blast radius with process isolation
Run job isolation so each job executes in a separate PHP process. The master listener becomes a lightweight supervisor; memory and file descriptors are reclaimed by the OS after every job.
$ php yii queue/listen --isolate=1 --verbose=1 --sleep=1For cron style execution, prefer
queue/run
with a bounded number of jobs per invocation to enforce periodic process renewal.$ php yii queue/run --verbose=12) Repair database and transaction lifecycles
Ensure every transaction, even under exceptions, closes promptly. Close and reopen the connection at safe boundaries to guarantee fresh state. Avoid keeping AR instances with lazy relations alive after commit.
\3) Switch to batching and streaming in hot paths
Replace
find()->all()
withbatch()
oreach()
. Operate on scalar projections when possible to avoid AR overhead. Defer relationship loading; fetch only what you need.\4) Implement deadlock safe transactions with bounded retries
Wrap transaction bodies with a helper that recognizes deadlock/serialization errors and retries with exponential backoff and jitter. Keep the maximum attempts small and emit metrics for visibility.
\5) Prevent cache stampedes
Use
getOrSet
with a per key mutex so that only one worker rebuilds an expired item. Add jitter to TTLs to spread expirations over time. Consider soft TTLs with background refresh for hot keys.\6) Redesign pagination for high volume scans
Offset based pagination on large tables grows slower over time and exacerbates deadlocks. Prefer keyset pagination that uses a stable ordered cursor, which is trivial to implement with an indexed column like
id
or a timestamp.\","id",$cursor]) ->orderBy(["id" => SORT_ASC]) ->limit(1000) ->all(); $lastId = end($rows)["id"] ?? $cursor;7) Make state disposable
Do not register request scoped services as singletons. Avoid static caches inside repositories/services used by workers. Where you must keep a singleton, give it an explicit
reset()
method called after each job to release references and close clients.\8) Operational guardrails: health, rotation, and back pressure
Add a watchdog that exits the worker when RSS exceeds a configured ceiling or after N jobs. Rely on systemd/Supervisor/Kubernetes to restart cleanly. Implement queue back pressure by lowering concurrency when downstream errors rise.
\Configuration hardening
Enable schema cache and tune durations
Ensure
enableSchemaCache
is on in both web and console configs. Use a shared cache component. Set a sensibleschemaCacheDuration
so metadata is stable during a deploy window but refreshes after migrations.\Queue listener flags and supervisors
Prefer
--isolate=1
for heavy jobs. Under Supervisor or systemd, configure immediate restarts on exit code 0 to support intentional rotation. Add a short sleep to avoid hot spin if the queue is empty.[program:yii-queue] command=php /app/yii queue/listen --isolate=1 --sleep=1 --verbose=1 numprocs=4 autorestart=true redirect_stderr=true stdout_logfile=/var/log/queue.logGraceful deployments around migrations
Pause workers before applying database migrations that alter hot tables, then resume with the new code. This prevents shape mismatches that would otherwise throw exceptions and leak partially initialized objects.
Performance optimization checklist
Use this quick list after stabilizing the worker:
- Turn on opcache for CLI if startup time dominates and code is immutable on the node.
- Replace AR with
Query
builders for large scans; hydrate DTOs instead of full models. - Use partial indexes and covering indexes for the worker’s critical paths.
- Compress payloads in the queue if network dominates, but measure CPU impact.
- Batch external API calls; prefer idempotent bulk endpoints.
- Keep concurrency modest; parallelism amplifies deadlocks on the same hot rows.
- Persist idempotency keys to avoid duplicate side effects during retries.
- Emit cardinality bounded metrics per job type and error code for SLOs.
Pitfalls and anti patterns
- Global state in helpers that caches the last processed job or request context.
- Static registries of models for convenience debugging that never get cleared.
- Retry on any exception without classification or backoff.
- Offset pagination on write heavy tables, causing table scans and lock contention.
- Using a single massive cache key as a cross service rendezvous point.
- Running workers during schema migrations that add/drop columns used by hot queries.
Putting it all together: a resilient Yii worker skeleton
This skeleton shows key ideas: isolation friendly design, explicit cleanup, deadlock aware transactions, metrics, and bounded resource usage.
\Long term design choices
Choose isolation by default for heavy jobs
If a job touches many rows, calls multiple services, or performs large DTO/AR graphs, treat isolation as the default. Use in process execution only for very small and frequent tasks where startup cost is the bottleneck.
Segment queue topics by contention domain
Place jobs that write to the same tables or rows into the same topic and run them with limited concurrency. This reduces cross topic lock contention and makes back pressure easier.
Design idempotence from the start
Make job handlers safe to run multiple times. Store idempotency keys with expiry, use natural keys where possible, and avoid non deterministic side effects like random coupon assignment without a stable seed.
Prefer query builders and raw SQL for bulk paths
AR is ideal for complex business rules on single aggregates but imposes object overhead on bulk processing. For the 20 percent of paths that process 80 percent of the volume, use
Query
and bulk DML.Bounded caches and explicit invalidation
Favor many small keys with targeted TTLs over a few giant aggregates. Provide explicit invalidation paths triggered by writes to keep read repair simple and predictable.
Testing and verification
Load tests that mimic diurnal patterns
Generate load with the same peaks, valleys, and payload mix you see in production. Many leaks show only after specific job orderings or when a rare path is executed.
Chaos drills for downstream dependencies
Introduce fail slow and fail fast modes in downstream services and databases. Verify that backoff, idempotence, and circuit breakers behave correctly and that workers shed load gracefully.
Canary and dark launch workers
Run a small percentage of jobs through a new worker version and compare metrics side by side with the stable cohort. Promote only when memory slope and latency distributions match or improve.
Conclusion
Memory growth, connection leaks, deadlocks, and cache stampedes in long lived Yii workers are not random acts. They are predictable outcomes of process persistence, mis scoped singletons, AR heavy loops, and insufficient isolation. By measuring per job resource usage, closing transactional boundaries, adopting isolation, batching, and deadlock aware retries, you can make Yii workers boringly reliable. Wrap that with operational guardrails—bounded RSS, rotation, graceful deploys, and canaries—and your queue backed services will scale cleanly with traffic and complexity.
FAQs
1. Should I always run yii2-queue with --isolate=1?
No. Isolation is the safest default for heavy/complex jobs, but for very small CPU bound tasks the fork cost may dominate. Measure both modes under realistic load and consider a hybrid approach by topic.
2. How do I find which service or singleton is leaking?
Add a
reset()
method to suspect services and call it after each job while logging memory deltas. Binary search by disabling one reset at a time; the one that changes the slope is your prime suspect.3. Why does offset pagination make deadlocks worse?
Offset pagination forces the database to scan and skip growing numbers of rows, extending lock lifetimes and amplifying contention. Keyset pagination touches only the next contiguous window and keeps locks localized.
4. Is enabling opcache for CLI a good idea?
Often yes if code changes are deployed via new containers or if you restart workers on each deploy. It reduces startup overhead per isolated job. Avoid it if you hot edit code on the node during debugging.
5. What is the simplest production safe stop/start strategy?
Use a supervisor (systemd/Supervisor/Kubernetes) with health probes that fail on high RSS or repeated errors. Exit voluntarily on those conditions and let the supervisor restart a clean process with exponential backoff.