Understanding Huawei Cloud Architecture

Core Components

  • ECS (Elastic Cloud Server): Core compute unit for virtual machines
  • VPC (Virtual Private Cloud): Network isolation and routing
  • OBS (Object Storage Service): S3-compatible storage layer
  • CCE (Cloud Container Engine): Kubernetes-based managed container platform
  • IAM: Access control via custom policies and fine-grained permissions

Multi-Region Considerations

Huawei Cloud regions are geographically segmented, and service availability differs slightly. Cross-region network latency, VPC peering limitations, and inconsistent service endpoints can introduce unexpected behavior.

Common Troubleshooting Scenarios

1. VPC Routing and Subnet Isolation Failures

Custom route tables may silently block traffic between ECS instances or subnets. Improper ACL or security group combinations often create confusing access behavior.

// Check effective routes
vpcClient.listRoutes("vpc-id")

Fix: Audit route tables and ensure overlapping CIDRs don't exist across peered VPCs. Avoid using 0.0.0.0/0 in both default and custom tables simultaneously.

2. OBS Object Consistency and Replication Delays

OBS guarantees eventual consistency by default. Applications expecting strong consistency may read stale data after object overwrite or delete.

Fix: Enable strong consistency via versioning and object metadata validation. For critical reads, implement exponential backoff and checksum comparison.

3. IAM Policy Conflicts and Least Privilege Gaps

Custom policies often conflict with default roles, causing access denial errors without clear logs. Service-level trust policies may override user-assigned rights.

{
  "action": ["obs:object:GetObject"],
  "effect": "allow",
  "resource": ["obs:*:bucket-name/*"]
}

Fix: Use IAM simulator in the console to test effective permissions. Avoid redundant allow/deny overlaps across group and project scopes.

4. API Rate Limiting in Multi-Tenant Workloads

High-frequency access to services like CCE or SMN may hit undocumented rate limits. Quotas are not always visible via standard APIs.

Fix: Implement exponential retries with backoff. Request quota increases proactively for batch workloads or CI/CD pipelines.

5. CCE Pod Scheduling Delays

CCE may delay pod scheduling due to exhausted ENIs, misconfigured node pools, or failed CSI drivers (e.g., EVS volumes stuck in attaching state).

Fix: Monitor kubelet and csi-attacher logs. Ensure node pool scaling policies have enough buffer and check for quota exhaustion.

Diagnostics and Debugging Techniques

1. Use Cloud Eye for Deep Observability

Enable metrics across ECS, OBS, ELB, and custom services. Set alarms on ECS bandwidth, OBS latency, and failed API requests.

2. Log Tank Service (LTS)

Aggregate logs from VPC flow, ELB, CCE, and application layers. Use structured filters to trace request paths across regions or services.

3. CLI-Based Diagnostics

// Check ECS metadata and instance status
hcloud ecs list-instances
hcloud ecs show-instance --instance-id xxx

Combine CLI results with Terraform or HCL state to detect drift and misalignment across environments.

Fixes and Long-Term Solutions

1. Harden IAM with Policy Layers

  • Use scoped tokens instead of permanent AK/SK pairs
  • Apply separation of duties at project and domain levels
  • Version policies and enforce via CI gatekeeping

2. Implement Observability-First Design

Standardize log formats and metrics emission from all services. Use TraceService to correlate API flow with service health metrics.

3. Optimize Multi-Region Architecture

Use private line or inter-VPC peering with dedicated bandwidth for latency-sensitive services. Always validate endpoint availability per region before provisioning.

4. Automate Resource Cleanup and Drift Detection

Use ResourceFormation or Terraform with HCL diff analysis to auto-revert drifted states. Tag and expire unused volumes, ENIs, and floating IPs.

Best Practices

  • Use strong consistency settings for OBS when data integrity is critical
  • Benchmark across regions before deploying globally
  • Rotate IAM credentials automatically with Secrets Manager
  • Apply autoscaling for ECS/CSE with predictive CPU metrics
  • Isolate CI/CD environments using separate VPCs and IAM roles

Conclusion

Huawei Cloud provides enterprise-grade capabilities but requires deep platform awareness to ensure reliability, security, and scalability. Misconfigured IAM policies, unnoticed service limits, and multi-region gaps can silently impact uptime and developer productivity. By applying structured diagnostics, enforcing observability, and architecting with resilience in mind, teams can unlock the full power of Huawei Cloud in modern cloud-native deployments.

FAQs

1. Why do ECS instances lose connectivity after restart?

Check if elastic IP or security groups were dynamically unbound. Ensure the restart doesn't trigger IP reallocation in DHCP settings.

2. How can I ensure OBS objects are updated instantly?

Use versioning and metadata checks. For transactional systems, implement confirmation reads with checksum or timestamp validation.

3. What causes CCE pods to remain in 'Pending' state?

Common causes include resource exhaustion, node taints, or unavailable persistent volumes. Review scheduling events and node conditions.

4. Can I track IAM policy effectiveness before applying?

Yes, use Huawei's IAM policy simulator. It helps detect implicit deny or overlapping rules before live deployment.

5. Why does API rate limiting happen even under quota?

Some services apply soft throttling during peak periods or region-wide load. Implement retries and request formal quota reviews.