Understanding I/O Blocking and Process Hangs
What Happens?
- Processes hang in uninterruptible sleep (state:
D
) - Commands like
ls
,df
, orvi
freeze on certain mount points - Even
kill -9
fails to terminate affected processes - Disks or LUNs show I/O latency spikes without consistent patterns
Why It Matters
These hangs degrade system responsiveness and can delay automation scripts, backups, and even lead to service downtime. In virtualized environments or with multi-path disk subsystems, misconfigured paths exacerbate the issue.
Solaris I/O Subsystem Overview
Blocking System Calls
Operations like read()
, open()
, or stat()
block until data is returned. If the underlying device is slow, unresponsive, or unreachable, these calls can hang indefinitely in kernel space.
Device Tree and VFS Interaction
Solaris builds a device tree at boot and relies on the Virtual File System (VFS) layer to abstract physical media. When a device path is stale or fails mid-operation, VFS calls block until timeout or kernel intervention.
Impact of ZFS and Disk Pools
While ZFS is fault-tolerant, it is sensitive to slow or degraded disks. In degraded pools, even metadata reads can cause global delays across unrelated processes.
Root Causes
1. Stale or Failing Multipath I/O Paths
When SAN paths are misconfigured or degraded, Solaris may continue attempting access on a failing path, causing long delays before switching to a healthy path.
2. Timeout Misconfiguration
Defaults for disk timeouts or driver-level retries may be too generous, causing operations to block for minutes or more before aborting.
3. Unmounted or Hung NFS Mounts
Stale NFS mounts or remote servers that are unreachable can freeze any process accessing those paths, even indirectly.
4. ZFS Pool Degradation
When a ZFS pool becomes degraded (e.g., due to disk timeout), metadata operations slow down, impacting all processes reading or writing to the pool.
5. Hardware or Virtualization Layer Bugs
In some SPARC and x86 virtualized environments, legacy drivers or hypervisor bugs may block I/O queues, requiring firmware or BIOS updates.
Diagnostics and Tools
1. Identify Hung Processes
ps -eo pid,ppid,state,comm | grep ' D '
Look for processes in uninterruptible sleep (D
state) which usually signal blocked kernel-level I/O.
2. Trace System Calls with DTrace
dtrace -n 'syscall::read:entry /execname == "myapp"/ { @[ustack()] = count(); }'
Trace which system calls are hanging and correlate with offending binaries or files.
3. Monitor Disk Latency
iostat -xn 5
Look for high service times (svc_t) or wait times on devices—common indicators of blocked or slow I/O paths.
4. Use zpool status
for ZFS Pools
Identify degraded or faulted pools that may be delaying read/write access globally.
5. Check for Failing Paths with mpathadm
mpathadm show lu /dev/rdsk/cXtYdZs2
Shows active and standby paths. Errors here often reveal problematic multipath setups.
Step-by-Step Fix Strategy
1. Tune Timeout Values for Storage
Adjust values like sd_io_time
and scsi_watchdog_tick
to reduce how long a failed path is retried before failing over.
2. Replace hard
NFS Mounts with soft,intr
Use soft
and intr
mount options to allow NFS operations to fail rather than hang indefinitely.
3. Automate Health Checks in SMF Services
Incorporate proactive health checks in Solaris SMF service manifests to detect and restart services encountering hangs.
4. Rebuild Device Paths After SAN Changes
devfsadm -Cv
Use this to clean up and re-enumerate device paths post-failure or after LUN reconfiguration.
5. Patch and Firmware Update
Ensure HBA firmware, virtualization drivers, and kernel patches are up to date, especially in hybrid SPARC/x86 deployments with old hardware.
Best Practices
- Use ZFS pools with redundancy (RAIDZ, mirrored) and monitor health daily
- Configure multipath failover correctly with priority and weight settings
- Limit dependency on NFS for critical runtime services
- Include timeout handling in all scripts using
df
,stat
, orfind
- Run periodic DTrace audits on long-running processes
Conclusion
Intermittent I/O-related hangs in Solaris systems are rarely due to a single root cause. More often, they stem from subtle interactions between device timeouts, filesystem state, and external dependencies like NFS or SAN multipathing. By understanding how Solaris handles blocking I/O and tuning the stack for quick failover and observability, administrators can ensure responsive, fault-tolerant systems that live up to Solaris’s enterprise reputation—even in modern virtualized environments.
FAQs
1. Why does kill -9
not terminate a hung process?
If the process is in kernel space (state D
), it cannot be killed until the I/O operation completes or times out. Only reboot or device reset will clear it.
2. Can ZFS cause system-wide hangs?
Yes. If one disk in a pool becomes unresponsive, metadata operations may block, affecting all datasets in the pool.
3. How do I detect which file or device a process is stuck on?
Use pfiles <pid>
to inspect open file descriptors or DTrace to trace syscall activity and blocking paths.
4. What’s the risk of using soft NFS mounts?
soft
mounts can cause data corruption in write-heavy apps if the remote server drops, but are safer for read-only or infrequent access.
5. Is Solaris still viable for critical workloads?
Yes—particularly with ZFS and DTrace. But modern tooling, patching discipline, and architecture planning are required for long-term support.