GoCD Architecture Overview
Key Components
- GoCD Server: Central coordination point that schedules jobs, manages pipelines, and communicates with agents.
- GoCD Agents: Execute build/test/deploy tasks. Can be elastic (dynamically provisioned) or static.
- Artifacts Repository: Stores build outputs shared across pipeline stages.
- Plugins: Extend functionality (Docker, Kubernetes, SCMs, Secrets Management).
Pipeline Modeling
Pipelines in GoCD are explicitly modeled with materials (SCMs), stages, jobs, and tasks. This declarative modeling is powerful but sensitive to configuration changes, especially in dynamic environments.
Common Production-Level Issues in GoCD
1. Stuck or Blocked Pipelines
Often caused by agent starvation, upstream stage failures, or manual approval stages left unattended. In highly parallelized systems, queued jobs may wait indefinitely for compatible agents.
2. Elastic Agent Auto-Scaling Failures
Elastic agents (e.g., EC2, Kubernetes) may fail to register due to IAM permission issues, plugin misconfiguration, or resource limits on the underlying cloud platform.
3. Artifact Fetch Failures
GoCD uses artifact stores to pass data across jobs. Incorrect paths, volume mounts, or expired artifacts can cause downstream stages to crash.
4. Plugin State Corruption
Faulty upgrades or improper shutdowns can leave plugin metadata in inconsistent states, leading to cryptic errors in secrets or SCM integrations.
Diagnostics and Troubleshooting
Step 1: Analyze Server and Agent Logs
Key files to inspect:
/var/log/go-server/go-server.log /var/log/go-agent/go-agent.log /var/lib/go-server/plugins/logs/
Look for patterns like:
[ERROR] Job hung due to missing artifact at path: artifacts/pipeline1/job1/output.zip [WARN] Elastic plugin did not provision agent within timeout
Step 2: Verify Agent Registration and Status
GoCD Admin UI → Agents -- Check for missing heartbeat or unknown status -- Ensure agent resources match job requirements
Also validate connectivity using:
telnet gocd-server 8153
Step 3: Audit Plugin Integrity
Check plugin health via:
GoCD Admin UI → Plugins -- Look for red status or version mismatches -- Inspect plugin descriptor in plugin.xml
Or query plugin JSON endpoints:
curl http://localhost:8153/go/api/admin/plugin_info -H "Accept: application/vnd.go.cd.v4+json"
Common Pitfalls to Avoid
- Overloading pipelines with too many downstream dependencies without fan-in/fan-out control.
- Hardcoding credentials into pipeline YAML instead of using secrets plugins.
- Misusing elastic profiles without resource tagging, leading to agent mismatch errors.
- Not version-locking plugins between environments (dev/stage/prod).
Step-by-Step Fixes
1. Resolve Agent Mismatch
-- Check job resource tags: resources: ["docker", "build"] -- Ensure agent config includes matching tags: go-agent-launcher.properties GO_AGENT_RESOURCES=docker,build
2. Recover from Plugin Failures
-- Remove corrupted plugin files: rm -rf /var/lib/go-server/plugins/bundled/-- Re-download from official source and restart server
3. Fix Artifact Fetch Errors
-- Ensure correct publish step: publish_artifact: source: target/output.zip destination: output -- Fetch using relative path from previous stage fetch_artifact: pipeline: pipeline1 stage: build job: compile source: output/output.zip
Best Practices for CI/CD Resilience in GoCD
- Implement pipeline templating and DRY configurations using YAML DSL and config repositories.
- Monitor server and agent metrics via Prometheus + GoCD exporter plugins.
- Use elastic agents with strict idle timeout controls to avoid resource leaks.
- Isolate pipeline groups by team or service boundary for better governance.
- Run regular plugin compatibility audits before GoCD upgrades.
Conclusion
While GoCD offers high flexibility and visual traceability, its enterprise-level complexity demands strong operational discipline. Many CI/CD issues stem not from bugs but from configuration drift, plugin misalignment, and lack of observability. With structured diagnostics, logging discipline, and architectural clarity around pipelines and agents, teams can maintain a robust delivery pipeline that scales with growing application ecosystems.
FAQs
1. Why do some jobs hang indefinitely in GoCD?
This often results from no compatible agents being available. Ensure agent resources match the job's resource tags.
2. How can I debug plugin failures in GoCD?
Review plugin logs under /var/lib/go-server/plugins/logs/
and inspect plugin compatibility in the Admin UI.
3. Can GoCD integrate with Kubernetes?
Yes, using the Kubernetes Elastic Agent Plugin, which provisions ephemeral agents in K8s pods with resource templates.
4. How do I prevent artifact-related pipeline failures?
Always verify artifact source and destination paths and set artifact expiration policies to avoid fetch failures.
5. What is the best way to scale GoCD pipelines?
Use pipeline templates, split pipelines into smaller units, and horizontally scale agents based on job concurrency patterns.