Understanding Equinix Metal Architecture
Device Lifecycle and Provisioning
Equinix Metal devices transition through several states—queued
, provisioning
, active
, deprovisioning
. Provisioning involves PXE boot, OS installation, and metadata injection.
Networking and Bonding Modes
Each server includes multiple NICs configurable via Layer 2, Layer 3, or Hybrid bonding. Users can select from bonded
or unbonded
setups depending on traffic segmentation needs.
Common Equinix Metal Issues
1. Device Stuck in Provisioning
This occurs due to PXE boot failures, custom OS image errors, or capacity delays. The API shows the device as provisioning
indefinitely without progressing.
2. Incorrect or Broken Network Bonding
Failure to apply correct bonding mode results in dropped packets, unreachable devices, or inconsistent MTU behavior—especially in Layer 2 configurations.
3. Failed Terraform Apply or Destroy
Terraform providers for Equinix Metal sometimes fail on state refresh or deletion with errors like device not found
or cannot destroy allocated resource
, usually due to lingering leases or deprovisioning lag.
4. Metadata or User Data Not Applied
Cloud-init or ignition configs fail if the metadata server is unreachable, improperly formatted, or blocked by firewall settings. Devices boot without config or SSH keys.
5. Cluster API Equinix Metal Provider Errors
CAPM (Cluster API Provider Metal) may stall due to incomplete MachineDeployment rollout, missing bootstrap templates, or unmanaged device tagging in Equinix Metal.
Diagnostics and Debugging Techniques
Use Packet CLI and Metal API
Query device state via metal device get [UUID]
or curl the REST API to inspect provisioning logs and timestamps. Compare against expected transition times.
Inspect iPXE Boot Logs
Attach to the device’s out-of-band console via metal console
or web portal. Review iPXE output for errors loading boot scripts or kernel images.
Review Network Mode and Port States
Run ethtool
and ip a
from within the device. Validate bonding settings, MTU, and interface activation. Use metal ports list
to confirm backend link state.
Terraform Debug Output
Set TF_LOG=DEBUG
and TF_LOG_PATH
to capture verbose logs. Cross-check resource IDs in state files with those returned by metal device list
.
Inspect Metadata Endpoint from Device
From the device, curl http://metadata.packet.net/metadata
to ensure metadata is reachable. Validate user_data
or custom_data
sections for YAML or JSON syntax errors.
Step-by-Step Resolution Guide
1. Fix Stuck Provisioning
Check if PXE boot failed via console. Reboot the device or redeploy with a different OS. If using custom OS, ensure the image is UEFI-compatible and reachable by HTTP.
2. Resolve Network Bonding Failures
Switch to Layer 3 bonding if Layer 2 dependencies (like VLAN tagging) are not needed. Reboot with updated bonding mode via the Metal portal or API.
3. Correct Terraform Apply/Destroy Issues
Manually delete resources that are stuck via API or CLI. Ensure Terraform state is refreshed. Set timeouts
for long device transitions.
4. Ensure Metadata and SSH Key Injection
Ensure cloud-init or ignition is installed and properly triggered. Validate metadata URLs and confirm user_data syntax via linters before provisioning.
5. Fix Cluster API Provisioning Stalls
Validate MachineDeployment
and bootstrap references. Use CAPM logs and metal API tagging to confirm all cluster resources are created and tracked correctly.
Best Practices for Equinix Metal
- Use reserved hardware plans for deterministic provisioning times.
- Automate bonding mode selection per workload type (e.g., Layer 3 for Kubernetes).
- Keep user data templates under version control and test with metadata emulators.
- Pin Terraform provider versions and manage state in remote backends.
- Monitor provisioning latency with custom Prometheus exporters or API hooks.
Conclusion
Equinix Metal provides powerful bare-metal automation but demands a deeper understanding of provisioning mechanics, network bonding, and orchestration tooling. By using CLI/API inspection, verifying metadata reachability, and debugging infrastructure-as-code state, engineers can overcome provisioning stalls, misconfigurations, and integration failures. With disciplined use of tooling and observability, Metal-based infrastructure can be both performant and production-hardened.
FAQs
1. Why is my server stuck in provisioning for over 10 minutes?
PXE boot or custom OS image failures are likely. Check console logs and verify the OS image URL and format.
2. How can I switch bonding mode after provisioning?
You must re-provision the device with the desired bonding mode. It cannot be changed live on Metal.
3. Why aren't my SSH keys or cloud-init scripts applied?
Check metadata reachability and syntax in user_data
. Use curl from the device to verify metadata access.
4. What causes Terraform to fail destroying devices?
The device may still be deprovisioning or has stale leases. Confirm via Metal CLI and use timeouts
in your configuration.
5. Can I use custom PXE scripts with Equinix Metal?
Yes, through custom OS provisioning. Provide a script-hosted kernel/initrd and validate boot via the console logs.