Resolving Build Queue Congestion and Agent Starvation in TeamCity CI/CD

Details: Category: CI/CD (Continuous Integration/Continuous Deployment); By Mindful Chase; 31.Jul; Hits: 356

In modern enterprise DevOps pipelines, TeamCity is a widely adopted CI/CD tool valued for its flexibility, extensive plugin ecosystem, and robust integration capabilities. However, as organizations scale and build configurations become more complex, subtle problems emerge that can cripple deployment velocity. One such frequently overlooked but critical issue is "Build Queue Congestion and Agent Starvation in High-Concurrency Pipelines." While simple to misinterpret as mere performance degradation, this challenge often stems from architectural misconfigurations, insufficient resource planning, or bottlenecks in VCS polling mechanisms. This article provides a deep dive into diagnosing, mitigating, and permanently resolving this issue at scale.

Mindful Chase

Writing Code, Writing Stories

tbd

Experience

tbd

More to Explore

Background: Understanding TeamCity Build Queuing

How TeamCity Handles Concurrent Builds

TeamCity assigns builds to agents via a build queue. If agents are unavailable or misconfigured, builds remain in the queue, often for long periods. TeamCity's behavior is governed by compatibility checks, agent pools, build priorities, and custom triggers—all of which influence queue dynamics. In large environments, queue buildup can lead to:

Inconsistent build times
Blocked deployments
Increased infrastructure costs from idle agents

Root Causes of Build Queue Congestion

Insufficient build agents or poorly distributed agent pools
Builds requiring exclusive agents or special capabilities
VCS polling or trigger floods due to frequent commits
Unbounded or recursive trigger chains
Resource leaks in long-running builds

Architectural Considerations

Agent Pools and Specialization

Assigning all builds to a generic agent pool often leads to starvation when specialized builds require capabilities (e.g., Docker, Android SDK) that only a subset of agents support. It's critical to define agent requirements in build steps and segment agents into pools by capability.

Trigger Optimization

Unchecked VCS triggers—especially in monorepos—cause cascading build initiations that quickly saturate the queue. Using branch filters, file path rules, and quiet periods can help prevent this.

Diagnostics

Identifying Build Starvation

Navigate to the "Build Queue" in TeamCity UI and check for builds marked with "Waiting for Compatible Agent". Use the following tools:

Administration > Agents > Compatible Configurations
Agent logs: teamcity-agent.log and buildAgent.properties
Server logs: teamcity-server.log

CLI and REST API Debugging

curl -u username:token https://teamcity.example.com/httpAuth/app/rest/buildQueue
curl -u username:token https://teamcity.example.com/httpAuth/app/rest/agents

These endpoints allow scriptable inspection of queue depth and agent compatibility.

Step-by-Step Resolution

1. Segment Agent Pools

Go to Administration > Agent Pools and:

Create pools based on platform, language, or tools
Assign projects and build configurations explicitly

2. Define Build Requirements

In each build configuration:

Build Steps > Add Requirement
Requirement: "env.DOCKER_HOST" exists
Requirement: "teamcity.agent.jvm.os.name" equals "Linux"

3. Optimize Triggers

Trigger Settings:
- Use branch filter: +:refs/heads/main
- Add path rules: +:src/** -:docs/**
- Set Quiet Period: 30 seconds

4. Increase Agent Capacity

Provision on-demand cloud agents with auto-scaling
Ensure custom images have latest build tools preinstalled

5. Implement Build Prioritization

Administration > Project Settings > Build Features
Add: Priority Classifier
Example Rule: if %branch% == main → High Priority

Best Practices

Use templates for consistent build config across projects
Monitor agent utilization via built-in metrics or Prometheus plugin
Avoid recursive trigger chains unless absolutely necessary
Run load tests periodically using synthetic VCS events
Leverage the Kotlin DSL to version and audit build config changes

Conclusion

Build queue congestion and agent starvation are silent killers of CI/CD velocity in TeamCity environments. While seemingly benign, they signal deeper architectural or process-level inefficiencies. By segmenting agent pools, defining explicit requirements, optimizing triggers, and leveraging infrastructure elasticity, teams can restore throughput and reduce MTTR. Investing in observability and configuration as code ensures resilience as scale and complexity grow.

FAQs

1. How many builds can TeamCity queue safely?

There is no hard limit, but performance may degrade with thousands of queued builds unless the server and database are scaled appropriately.

2. Can agents be auto-scaled in TeamCity?

Yes. TeamCity supports cloud profiles with AWS, Azure, GCP, and custom Docker-based agents that scale based on queue depth.

3. Why do builds say "Waiting for Compatible Agent" even when agents are idle?

Most likely, the build requirements do not match any agent capabilities. Check the build configuration requirements against available agents.

4. How do I reduce noisy triggers from monorepos?

Use file path rules to trigger builds only when relevant directories change, and apply branch filters to exclude dev or experimental branches.

5. Is it safe to delete builds from the queue manually?

Yes, but it should be done with caution to avoid discarding critical deployments. Automate cleanup using REST API if needed.

Contact Us