Troubleshooting Equinix Metal: Fixing Provisioning Failures, Network Bonding Errors, Terraform Issues, Metadata Problems, and Cluster API Stalls

Details: Category: Cloud Platforms and Services; By Mindful Chase; 20.Apr; Hits: 163

Equinix Metal is a bare-metal cloud service offering automated provisioning of physical servers in global data centers. Built for performance-sensitive workloads, it supports custom OS installs, Kubernetes (via Cluster API), and direct integration with the Equinix Fabric for low-latency networking. Despite its raw power and flexibility, engineers working with Equinix Metal often face challenges involving PXE boot errors, provision stuck states, network bonding misconfigurations, API automation bugs, and integration gaps with tools like Terraform and Cluster API. This article presents a thorough troubleshooting guide tailored for managing and scaling infrastructure on Equinix Metal.

Mindful Chase

Writing Code, Writing Stories

tbd

Experience

tbd

More to Explore

Understanding Equinix Metal Architecture

Device Lifecycle and Provisioning

Equinix Metal devices transition through several states—queued, provisioning, active, deprovisioning. Provisioning involves PXE boot, OS installation, and metadata injection.

Networking and Bonding Modes

Each server includes multiple NICs configurable via Layer 2, Layer 3, or Hybrid bonding. Users can select from bonded or unbonded setups depending on traffic segmentation needs.

Common Equinix Metal Issues

1. Device Stuck in Provisioning

This occurs due to PXE boot failures, custom OS image errors, or capacity delays. The API shows the device as provisioning indefinitely without progressing.

2. Incorrect or Broken Network Bonding

Failure to apply correct bonding mode results in dropped packets, unreachable devices, or inconsistent MTU behavior—especially in Layer 2 configurations.

3. Failed Terraform Apply or Destroy

Terraform providers for Equinix Metal sometimes fail on state refresh or deletion with errors like device not found or cannot destroy allocated resource, usually due to lingering leases or deprovisioning lag.

4. Metadata or User Data Not Applied

Cloud-init or ignition configs fail if the metadata server is unreachable, improperly formatted, or blocked by firewall settings. Devices boot without config or SSH keys.

5. Cluster API Equinix Metal Provider Errors

CAPM (Cluster API Provider Metal) may stall due to incomplete MachineDeployment rollout, missing bootstrap templates, or unmanaged device tagging in Equinix Metal.

Diagnostics and Debugging Techniques

Use Packet CLI and Metal API

Query device state via metal device get [UUID] or curl the REST API to inspect provisioning logs and timestamps. Compare against expected transition times.

Inspect iPXE Boot Logs

Attach to the device’s out-of-band console via metal console or web portal. Review iPXE output for errors loading boot scripts or kernel images.

Review Network Mode and Port States

Run ethtool and ip a from within the device. Validate bonding settings, MTU, and interface activation. Use metal ports list to confirm backend link state.

Terraform Debug Output

Set TF_LOG=DEBUG and TF_LOG_PATH to capture verbose logs. Cross-check resource IDs in state files with those returned by metal device list.

Inspect Metadata Endpoint from Device

From the device, curl http://metadata.packet.net/metadata to ensure metadata is reachable. Validate user_data or custom_data sections for YAML or JSON syntax errors.

Step-by-Step Resolution Guide

1. Fix Stuck Provisioning

Check if PXE boot failed via console. Reboot the device or redeploy with a different OS. If using custom OS, ensure the image is UEFI-compatible and reachable by HTTP.

2. Resolve Network Bonding Failures

Switch to Layer 3 bonding if Layer 2 dependencies (like VLAN tagging) are not needed. Reboot with updated bonding mode via the Metal portal or API.

3. Correct Terraform Apply/Destroy Issues

Manually delete resources that are stuck via API or CLI. Ensure Terraform state is refreshed. Set timeouts for long device transitions.

4. Ensure Metadata and SSH Key Injection

Ensure cloud-init or ignition is installed and properly triggered. Validate metadata URLs and confirm user_data syntax via linters before provisioning.

5. Fix Cluster API Provisioning Stalls

Validate MachineDeployment and bootstrap references. Use CAPM logs and metal API tagging to confirm all cluster resources are created and tracked correctly.

Best Practices for Equinix Metal

Use reserved hardware plans for deterministic provisioning times.
Automate bonding mode selection per workload type (e.g., Layer 3 for Kubernetes).
Keep user data templates under version control and test with metadata emulators.
Pin Terraform provider versions and manage state in remote backends.
Monitor provisioning latency with custom Prometheus exporters or API hooks.

Conclusion

Equinix Metal provides powerful bare-metal automation but demands a deeper understanding of provisioning mechanics, network bonding, and orchestration tooling. By using CLI/API inspection, verifying metadata reachability, and debugging infrastructure-as-code state, engineers can overcome provisioning stalls, misconfigurations, and integration failures. With disciplined use of tooling and observability, Metal-based infrastructure can be both performant and production-hardened.

FAQs

1. Why is my server stuck in provisioning for over 10 minutes?

PXE boot or custom OS image failures are likely. Check console logs and verify the OS image URL and format.

2. How can I switch bonding mode after provisioning?

You must re-provision the device with the desired bonding mode. It cannot be changed live on Metal.

3. Why aren't my SSH keys or cloud-init scripts applied?

Check metadata reachability and syntax in user_data. Use curl from the device to verify metadata access.

4. What causes Terraform to fail destroying devices?

The device may still be deprovisioning or has stale leases. Confirm via Metal CLI and use timeouts in your configuration.

5. Can I use custom PXE scripts with Equinix Metal?

Yes, through custom OS provisioning. Provide a script-hosted kernel/initrd and validate boot via the console logs.

Contact Us