Understanding Spinnaker Architecture
Microservice-Oriented Design
Spinnaker is composed of multiple services such as Orca (orchestration engine), Clouddriver (cloud provider integration), Front50 (metadata storage), Echo (eventing), Fiat (authorization), and Redis (state cache). These services communicate over HTTP and Redis queues. A failure or latency in any component can cause cascading delays.
Pipeline Execution Flow
When a pipeline is triggered, Orca orchestrates stage execution, interacts with Clouddriver and Echo, and tracks state in Redis. Orca depends heavily on cached data, and stale or corrupted cache entries can cause incorrect behavior or freezing pipelines.
Common Symptoms
- Pipelines stuck in "Starting" or mid-stage execution
- UI shows incomplete or empty execution history
- Unexpected timeouts during deployment stages
- Manual execution works but automated triggers fail
- Clouddriver or Redis memory usage spikes
Root Causes
1. Redis Overload or Expired Pipeline State
Redis stores pipeline execution state and trigger events. High volume or improper TTLs can lead to expired keys or blocked queues, stalling pipelines.
2. Misaligned Microservice Versions
Upgrading Spinnaker services independently may introduce API incompatibilities (e.g., Orca expects a newer Clouddriver API). This leads to silent errors or malformed payloads.
3. Stale or Corrupted Cache in Clouddriver
Clouddriver maintains a cache of cloud resources. If cache refresh fails or lags, Orca may receive outdated data, causing delays in deployments or artifact resolution.
4. Broken Webhook or Trigger Configurations
Malformed trigger configurations can prevent pipelines from starting. Inconsistent webhook headers, payload mismatches, or missing authorization scopes cause silent trigger drops.
5. External Provider Rate Limits
Frequent polling (e.g., Kubernetes, AWS APIs) may hit rate limits, especially in large environments. This delays Clouddriver cache refresh and pipeline progress.
Diagnostics and Monitoring
1. Check Orca and Clouddriver Logs
Search for Execution not found
, Stage failed to start
, or queue blocked
errors. Inspect log timestamps for delay gaps.
2. Monitor Redis Metrics
Use Redis CLI or Prometheus exporters to watch memory usage, blocked clients, and key TTLs. Look for key evictions and latency spikes.
3. Use Spinnaker's /health and /metrics Endpoints
Check the health status of each microservice and review internal metrics like orca.queue.depth
or clouddriver.cache.errors
.
4. Validate Cache State
Query Clouddriver's cache endpoint (e.g., /applications/{app}/serverGroups
) and compare with actual cloud state.
5. Debug Failed Triggers via Echo
Echo logs show trigger events and webhook handling. Look for payload parse errors or authorization failures.
Step-by-Step Fix Strategy
1. Increase Redis Capacity and Set TTLs
Ensure Redis has sufficient memory and configure TTLs on execution state keys. Use orchestration.executionRepository.redis.ttlSeconds
in Orca config.
2. Align Service Versions and Upgrade Cohesively
Always upgrade Spinnaker services in lockstep using Halyard or validated GitOps pipelines. Avoid partial upgrades unless testing in isolation.
3. Manually Refresh Clouddriver Cache
POST /cache/invalidate { "provider": "kubernetes", "type": "serverGroups", "account": "my-account", "region": "my-region" }
Force-refresh stale cloud cache when discrepancies are suspected.
4. Fix or Recreate Broken Triggers
Delete and recreate misfiring webhooks. Test payload formats and use echo.debug.enabled=true
to trace event reception.
5. Scale Out Clouddriver and Orca Replicas
Horizontal scaling of bottleneck services helps distribute queue load and reduce delay in large deployments.
Best Practices
- Use Redis in dedicated, monitored clusters with alerts on eviction and memory usage
- Define sane TTLs for pipelines to prevent Redis saturation
- Keep Spinnaker service versions aligned and tracked via Git
- Use built-in cache refresh APIs instead of restarting services
- Regularly audit triggers and webhook integrations for format drift
Conclusion
Spinnaker provides powerful deployment automation, but its microservice complexity requires precise coordination and monitoring. Pipeline freezes, execution gaps, and stale UI states often trace back to Redis exhaustion, service version drift, or cache latency. By systematically inspecting logs, metrics, and configuration, teams can stabilize Spinnaker pipelines and maintain high delivery velocity across environments.
FAQs
1. Why is my pipeline stuck in "Starting"?
Orca may be unable to fetch the execution context from Redis, or the pipeline trigger may have expired. Check Orca logs and Redis health.
2. Can Clouddriver cache be refreshed without restarting?
Yes. Use Clouddriver's cache invalidate API to refresh specific resources like server groups or load balancers.
3. How do I monitor failed webhook triggers?
Enable debug logging in Echo and inspect incoming event logs. Verify webhook payload format and authentication headers.
4. What’s the ideal Redis setup for Spinnaker?
Use a dedicated Redis instance (preferably Redis Sentinel or clustered) with persistence disabled and metrics monitoring enabled.
5. Is it safe to scale Spinnaker services independently?
Yes, but upgrade all core services together. You can horizontally scale services like Orca and Clouddriver to handle load better.