GoCD Architecture Overview

Key Components

  • GoCD Server: Central coordination point that schedules jobs, manages pipelines, and communicates with agents.
  • GoCD Agents: Execute build/test/deploy tasks. Can be elastic (dynamically provisioned) or static.
  • Artifacts Repository: Stores build outputs shared across pipeline stages.
  • Plugins: Extend functionality (Docker, Kubernetes, SCMs, Secrets Management).

Pipeline Modeling

Pipelines in GoCD are explicitly modeled with materials (SCMs), stages, jobs, and tasks. This declarative modeling is powerful but sensitive to configuration changes, especially in dynamic environments.

Common Production-Level Issues in GoCD

1. Stuck or Blocked Pipelines

Often caused by agent starvation, upstream stage failures, or manual approval stages left unattended. In highly parallelized systems, queued jobs may wait indefinitely for compatible agents.

2. Elastic Agent Auto-Scaling Failures

Elastic agents (e.g., EC2, Kubernetes) may fail to register due to IAM permission issues, plugin misconfiguration, or resource limits on the underlying cloud platform.

3. Artifact Fetch Failures

GoCD uses artifact stores to pass data across jobs. Incorrect paths, volume mounts, or expired artifacts can cause downstream stages to crash.

4. Plugin State Corruption

Faulty upgrades or improper shutdowns can leave plugin metadata in inconsistent states, leading to cryptic errors in secrets or SCM integrations.

Diagnostics and Troubleshooting

Step 1: Analyze Server and Agent Logs

Key files to inspect:

/var/log/go-server/go-server.log
/var/log/go-agent/go-agent.log
/var/lib/go-server/plugins/logs/

Look for patterns like:

[ERROR] Job hung due to missing artifact at path: artifacts/pipeline1/job1/output.zip
[WARN] Elastic plugin did not provision agent within timeout

Step 2: Verify Agent Registration and Status

GoCD Admin UI → Agents
-- Check for missing heartbeat or unknown status
-- Ensure agent resources match job requirements

Also validate connectivity using:

telnet gocd-server 8153

Step 3: Audit Plugin Integrity

Check plugin health via:

GoCD Admin UI → Plugins
-- Look for red status or version mismatches
-- Inspect plugin descriptor in plugin.xml

Or query plugin JSON endpoints:

curl http://localhost:8153/go/api/admin/plugin_info -H "Accept: application/vnd.go.cd.v4+json"

Common Pitfalls to Avoid

  • Overloading pipelines with too many downstream dependencies without fan-in/fan-out control.
  • Hardcoding credentials into pipeline YAML instead of using secrets plugins.
  • Misusing elastic profiles without resource tagging, leading to agent mismatch errors.
  • Not version-locking plugins between environments (dev/stage/prod).

Step-by-Step Fixes

1. Resolve Agent Mismatch

-- Check job resource tags:
resources: ["docker", "build"]

-- Ensure agent config includes matching tags:
go-agent-launcher.properties
GO_AGENT_RESOURCES=docker,build

2. Recover from Plugin Failures

-- Remove corrupted plugin files:
rm -rf /var/lib/go-server/plugins/bundled/

-- Re-download from official source and restart server

3. Fix Artifact Fetch Errors

-- Ensure correct publish step:
publish_artifact:
  source: target/output.zip
  destination: output

-- Fetch using relative path from previous stage
fetch_artifact:
  pipeline: pipeline1
  stage: build
  job: compile
  source: output/output.zip

Best Practices for CI/CD Resilience in GoCD

  • Implement pipeline templating and DRY configurations using YAML DSL and config repositories.
  • Monitor server and agent metrics via Prometheus + GoCD exporter plugins.
  • Use elastic agents with strict idle timeout controls to avoid resource leaks.
  • Isolate pipeline groups by team or service boundary for better governance.
  • Run regular plugin compatibility audits before GoCD upgrades.

Conclusion

While GoCD offers high flexibility and visual traceability, its enterprise-level complexity demands strong operational discipline. Many CI/CD issues stem not from bugs but from configuration drift, plugin misalignment, and lack of observability. With structured diagnostics, logging discipline, and architectural clarity around pipelines and agents, teams can maintain a robust delivery pipeline that scales with growing application ecosystems.

FAQs

1. Why do some jobs hang indefinitely in GoCD?

This often results from no compatible agents being available. Ensure agent resources match the job's resource tags.

2. How can I debug plugin failures in GoCD?

Review plugin logs under /var/lib/go-server/plugins/logs/ and inspect plugin compatibility in the Admin UI.

3. Can GoCD integrate with Kubernetes?

Yes, using the Kubernetes Elastic Agent Plugin, which provisions ephemeral agents in K8s pods with resource templates.

4. How do I prevent artifact-related pipeline failures?

Always verify artifact source and destination paths and set artifact expiration policies to avoid fetch failures.

5. What is the best way to scale GoCD pipelines?

Use pipeline templates, split pipelines into smaller units, and horizontally scale agents based on job concurrency patterns.