Fixing Intermittent Process Hangs and Blocked I/O in Solaris Systems

Details: Category: Operating Systems; By Mindful Chase; 21.Apr; Hits: 145

Solaris, the Unix operating system developed by Sun Microsystems and now maintained by Oracle, is renowned for its stability, scalability, and features like ZFS and DTrace. However, system administrators managing large-scale Solaris deployments frequently encounter a persistent challenge: "intermittent process hangs due to blocked I/O on legacy hardware or virtualized disk subsystems". These hangs often manifest without errors in logs, leaving administrators puzzled by stuck applications, unresponsive shell sessions, or zombie processes. This article explores Solaris I/O architecture, the root causes of these hangs, and provides actionable diagnostics and tuning techniques for ensuring system responsiveness.

Mindful Chase

Writing Code, Writing Stories

tbd

Experience

tbd

More to Explore

Understanding I/O Blocking and Process Hangs

What Happens?

Processes hang in uninterruptible sleep (state: D)
Commands like ls, df, or vi freeze on certain mount points
Even kill -9 fails to terminate affected processes
Disks or LUNs show I/O latency spikes without consistent patterns

Why It Matters

These hangs degrade system responsiveness and can delay automation scripts, backups, and even lead to service downtime. In virtualized environments or with multi-path disk subsystems, misconfigured paths exacerbate the issue.

Solaris I/O Subsystem Overview

Blocking System Calls

Operations like read(), open(), or stat() block until data is returned. If the underlying device is slow, unresponsive, or unreachable, these calls can hang indefinitely in kernel space.

Device Tree and VFS Interaction

Solaris builds a device tree at boot and relies on the Virtual File System (VFS) layer to abstract physical media. When a device path is stale or fails mid-operation, VFS calls block until timeout or kernel intervention.

Impact of ZFS and Disk Pools

While ZFS is fault-tolerant, it is sensitive to slow or degraded disks. In degraded pools, even metadata reads can cause global delays across unrelated processes.

Root Causes

1. Stale or Failing Multipath I/O Paths

When SAN paths are misconfigured or degraded, Solaris may continue attempting access on a failing path, causing long delays before switching to a healthy path.

2. Timeout Misconfiguration

Defaults for disk timeouts or driver-level retries may be too generous, causing operations to block for minutes or more before aborting.

3. Unmounted or Hung NFS Mounts

Stale NFS mounts or remote servers that are unreachable can freeze any process accessing those paths, even indirectly.

4. ZFS Pool Degradation

When a ZFS pool becomes degraded (e.g., due to disk timeout), metadata operations slow down, impacting all processes reading or writing to the pool.

5. Hardware or Virtualization Layer Bugs

In some SPARC and x86 virtualized environments, legacy drivers or hypervisor bugs may block I/O queues, requiring firmware or BIOS updates.

Diagnostics and Tools

1. Identify Hung Processes

ps -eo pid,ppid,state,comm | grep ' D '

Look for processes in uninterruptible sleep (D state) which usually signal blocked kernel-level I/O.

2. Trace System Calls with DTrace

dtrace -n 'syscall::read:entry /execname == "myapp"/ { @[ustack()] = count(); }'

Trace which system calls are hanging and correlate with offending binaries or files.

3. Monitor Disk Latency

iostat -xn 5

Look for high service times (svc_t) or wait times on devices—common indicators of blocked or slow I/O paths.

4. Use `zpool status` for ZFS Pools

Identify degraded or faulted pools that may be delaying read/write access globally.

5. Check for Failing Paths with `mpathadm`

mpathadm show lu /dev/rdsk/cXtYdZs2

Shows active and standby paths. Errors here often reveal problematic multipath setups.

Step-by-Step Fix Strategy

1. Tune Timeout Values for Storage

Adjust values like sd_io_time and scsi_watchdog_tick to reduce how long a failed path is retried before failing over.

2. Replace `hard` NFS Mounts with `soft,intr`

Use soft and intr mount options to allow NFS operations to fail rather than hang indefinitely.

3. Automate Health Checks in SMF Services

Incorporate proactive health checks in Solaris SMF service manifests to detect and restart services encountering hangs.

4. Rebuild Device Paths After SAN Changes

devfsadm -Cv

Use this to clean up and re-enumerate device paths post-failure or after LUN reconfiguration.

5. Patch and Firmware Update

Ensure HBA firmware, virtualization drivers, and kernel patches are up to date, especially in hybrid SPARC/x86 deployments with old hardware.

Best Practices

Use ZFS pools with redundancy (RAIDZ, mirrored) and monitor health daily
Configure multipath failover correctly with priority and weight settings
Limit dependency on NFS for critical runtime services
Include timeout handling in all scripts using df, stat, or find
Run periodic DTrace audits on long-running processes

Conclusion

Intermittent I/O-related hangs in Solaris systems are rarely due to a single root cause. More often, they stem from subtle interactions between device timeouts, filesystem state, and external dependencies like NFS or SAN multipathing. By understanding how Solaris handles blocking I/O and tuning the stack for quick failover and observability, administrators can ensure responsive, fault-tolerant systems that live up to Solaris’s enterprise reputation—even in modern virtualized environments.

FAQs

1. Why does `kill -9` not terminate a hung process?

If the process is in kernel space (state D), it cannot be killed until the I/O operation completes or times out. Only reboot or device reset will clear it.

2. Can ZFS cause system-wide hangs?

Yes. If one disk in a pool becomes unresponsive, metadata operations may block, affecting all datasets in the pool.

3. How do I detect which file or device a process is stuck on?

Use pfiles <pid> to inspect open file descriptors or DTrace to trace syscall activity and blocking paths.

4. What’s the risk of using soft NFS mounts?

soft mounts can cause data corruption in write-heavy apps if the remote server drops, but are safer for read-only or infrequent access.

5. Is Solaris still viable for critical workloads?

Yes—particularly with ZFS and DTrace. But modern tooling, patching discipline, and architecture planning are required for long-term support.

Contact Us