Exasol Troubleshooting at Scale: Skew, Spills, UDF Failures, and Network Edge Cases

Details: Category: Databases; By Mindful Chase; 09.Aug; Hits: 216

Exasol is a high-performance, MPP, in-memory columnar database widely adopted for analytics at enterprise scale. In day-to-day operations, most teams master routine tuning quickly, yet a class of rare but consequential problems still derails SLAs: uneven cluster rebalancing after node events, UDF container failures, sudden spill-to-disk regressions, and authentication edge cases under federated identity. This deep-dive equips architects and tech leads to identify root causes across compute, storage, and network layers; map symptoms to internal Exasol architecture; and implement durable fixes that survive version upgrades, hardware refreshes, and workload shifts.

Mindful Chase

Writing Code, Writing Stories

tbd

Experience

tbd

More to Explore

Background: What Makes Exasol Different

MPP, In-Memory, and Columnar Fundamentals

Exasol distributes data across nodes using a shared-nothing MPP architecture. Each node keeps compressed columnar segments in RAM (with a smart page cache) and coordinates via a high-speed interconnect. Query fragments are pushed down to the nodes holding the relevant partitions; intermediate results are exchanged through a distributed execution layer. This design delivers low-latency analytics under parallelism but also means that data skew, network congestion, or node-local resource contention translate directly into query variability.

Where Day-to-Day Troubleshooting Becomes Tricky

Elastic changes: Node additions, removals, or replacements can leave the system partially rebalanced if maintenance windows end early.
UDF isolation: Scripting UDFs (Python/R/Java) execute in containerized sandboxes; misaligned OS libraries, memory caps, or file descriptor limits create opaque errors.
Authentication at scale: Kerberos/LDAP/AD nuances—clock drift, referral chasing, nested groups—cause intermittent login failures hard to reproduce.
Storage pressure: A few wide joins or accidental Cartesian products trigger aggressive spilling, changing the performance profile overnight.

Architecture: Components That Influence Troubleshooting

Cluster Topology and Data Distribution

Tables are either distributed by a hash of the distribution key or replicated (for small dimension tables). Hash distribution drives local join efficiency; picking a poor key yields skew and cross-node traffic. System metadata records partition ownership per node and influences the planner's fragment placement.

Buffer Cache, Temp Space, and Spill Paths

Exasol's smart page cache keeps hot columnar chunks in RAM. When operators exceed memory grants, the engine spills to temp volumes. Incorrect tempfs sizing, shared underlying storage with other services, or noisy neighbors on virtualized disks can turn a rare spill into a systemic slowdown.

Execution, Exchange, and Network MTU

Distributed plans rely on high-throughput node-to-node exchange. Path MTU mismatches, jumbo frame asymmetry, or misconfigured NIC offloads increase fragmentation and CPU per byte transferred. These issues arise especially after switch replacement or mixed 25/40/100 GbE fabrics.

UDF Runtime Sandboxes

UDFs run in separate containers or processes with configurable memory/CPU quotas. The orchestration decides placement on a node, mounts temporary directories, and exposes connector libraries. ABI or Python wheel incompatibilities surface only under specific data shapes or larger batches.

Diagnostics: A Senior Engineer's Playbook

1) Confirm Whether the Problem Is Local, Skewed, or Systemic

-- Identify long-running or recently failed statements
SELECT session_id, user_name, statement_id, sql_text, start_time, run_time
FROM exa_dba_profile_last
ORDER BY run_time DESC
FETCH FIRST 20 ROWS ONLY;

-- Check per-node contribution for a single statement
SELECT node_id, cpu, temp_db_reads, temp_db_writes, net_bytes_sent, net_bytes_rcvd
FROM exa_dba_profile_last_detail
WHERE statement_id = <ID>
ORDER BY node_id;

If one or two nodes do disproportionate work, suspect data skew, a straggler node (IO/network), or UDF placement issues. If all nodes slow down together, suspect global spill or authentication delays.

2) Spot Data Skew and Distribution Key Mismatches

-- Distribution inspection for a large table
SELECT node_id, COUNT(*) AS rows_on_node
FROM (SELECT exa_node_id AS node_id FROM MY_SCHEMA.BIG_FACT)
GROUP BY node_id
ORDER BY rows_on_node DESC;

-- Check candidate key distinctness for a proposed distribution column
SELECT COUNT(DISTINCT candidate_col) / COUNT(*) AS distinct_ratio
FROM MY_SCHEMA.BIG_FACT;

Row count variance >= 20% across nodes for a big fact table commonly indicates a poor distribution key. A very low distinct_ratio signals hot partitions.

3) Determine If You Are Spilling and Why

-- Temp space usage snapshots
SELECT node_id, temp_db_current, temp_db_max, temp_db_peak_time
FROM exa_dba_nodes;

-- Operator-level indicators for the last statement
SELECT operator_id, operator_name, temp_db_read, temp_db_written, mem_peak
FROM exa_dba_operators
WHERE statement_id = <ID>
ORDER BY temp_db_written DESC;

High temp_db_written on join/sort operators reveals memory pressure. If the spike lines up with a nightly batch, suspect a new query plan or changed statistics.

4) Validate Execution Plan Stability

-- Capture and compare plans
EXPLAIN SELECT ...;

-- Plan hash for regression checks
SELECT statement_id, plan_id, plan_hash, start_time
FROM exa_dba_profile_last
WHERE sql_text LIKE 'SELECT ...'
ORDER BY start_time DESC;

Plan hash churn across identical SQL usually points to stale or missing statistics, a change in table cardinality, or a newly added node changing estimated exchange costs.

5) Network Health, MTU, and Jumbo Frames

-- Network indicators per node
SELECT node_id, net_bytes_sent, net_bytes_rcvd, net_errors
FROM exa_dba_nodes
ORDER BY net_errors DESC;

-- During a problem window, compare exchange-heavy queries
SELECT statement_id, SUM(net_bytes_sent) AS sent, SUM(net_bytes_rcvd) AS rcvd
FROM exa_dba_profile_last_detail
GROUP BY statement_id
ORDER BY sent DESC;

Rising net_errors after switch maintenance, combined with inflated CPU per exchanged GB, often implies MTU mismatch. Validate NIC and switch MTU end-to-end.

6) UDF Runtime Failures and Memory Caps

-- Most recent UDF errors
SELECT start_time, user_name, schema_name, udf_name, error_message
FROM exa_dba_audit_udf
ORDER BY start_time DESC
FETCH FIRST 50 ROWS ONLY;

-- Node placement for a failing UDF call
SELECT node_id, session_id, statement_id
FROM exa_dba_profile_last_detail
WHERE statement_id = <ID>;

Look for out of memory or cannot import messages. Failures confined to a node suggest a container runtime or library mismatch on that host.

7) Authentication and Directory Integration

-- Authentication events (if auditing enabled)
SELECT event_time, user_name, success, auth_method, client_host, message
FROM exa_dba_audit_sessions
ORDER BY event_time DESC;

-- Sessions waiting unusually long to start
SELECT session_id, user_name, connect_time, state, wait_event
FROM exa_dba_sessions
WHERE state <> 'ACTIVE'
ORDER BY connect_time DESC;

Intermittent failures correlating with KDC timeouts or LDAP referral loops reveal identity infrastructure, not database, as the root.

Pitfalls and Anti-Patterns

Choosing a Non-Selective Distribution Key

It seems intuitive to distribute by a business key, but low-cardinality keys (e.g., country_code) create hot partitions. Prefer high-cardinality, join-friendly keys shared across major fact tables.

Excessive Replication of Medium Tables

Replicating "small" tables that slowly grow beyond RAM budgets forces silent spills during joins. Monitor replicated table sizes and switch them to distributed once they cross a safe threshold.

Under-Provisioned Temp Volumes

Storing temp on shared disks with other analytics engines or backups multiplies jitter under batch contention. Temp should be isolated, fast, and sized for peak worst-case joins.

Unpinned UDF Environments

Installing Python wheels without pinning versions leads to random import errors across nodes after OS patching. Freeze per-node environments; rebuild deterministically.

Ignoring MTU Consistency

Mixed MTU across L2/L3 boundaries causes silent fragmentation and CPU burn. Always validate end-to-end MTU symmetry after any network change.

Step-by-Step Fixes

Fix A: Rebalance After Topology Changes

After adding or replacing nodes, confirm that data redistributes evenly. If maintenance ends early, you may have residual skew.

-- Check table-level distribution skew
SELECT table_schema, table_name, node_id, rows_on_node
FROM exa_dba_table_distribution
ORDER BY table_schema, table_name, rows_on_node DESC;

-- Trigger or resume rebalancing for specific tables (example approach)
ALTER TABLE MY_SCHEMA.BIG_FACT REORGANIZE;
-- Or for the entire schema, iterate via admin scripting

Schedule REORGANIZE during low-traffic windows and track progress by sampling exa_dba_table_distribution. Post-verify with plan times for representative joins.

Fix B: Repair Data Skew via Distribution Key Redesign

Identify hot joins and select a distribution key that aligns with the largest, most frequent join. Favor a high-cardinality column or a composite hash.

-- Evaluate candidate join columns for key selection
SELECT COUNT(DISTINCT a.join_key) AS d_a, COUNT(*) AS n_a,
       COUNT(DISTINCT b.join_key) AS d_b, COUNT(*) AS n_b
FROM FACT_A a JOIN FACT_B b ON a.join_key = b.join_key;

-- Implement a new distribution key
ALTER TABLE MY_SCHEMA.BIG_FACT DISTRIBUTE BY HASH(join_key);
ALTER TABLE MY_SCHEMA.BIG_FACT REORGANIZE;

Test critical queries before and after in a canary schema. Diff plan_hash and network bytes exchanged.

Fix C: Reduce Spill Through Memory-Aware Joins and Sorts

Rewrite queries to reduce intermediate cardinalities and leverage replicated dimensions where appropriate.

-- Example: early filter to reduce join size
WITH f AS (
  SELECT * FROM FACT_SALES
  WHERE sale_date BETWEEN DATE '2025-06-01' AND DATE '2025-06-30'
)
SELECT /*+ USE_REPL_TABLE(dim_customers) */
  f.*, d.segment
FROM f
JOIN dim_customers d ON f.customer_id = d.customer_id;

-- Monitor operator temp writes
SELECT operator_id, operator_name, temp_db_written
FROM exa_dba_operators
WHERE statement_id = <ID> AND temp_db_written > 0;

Consider pre-aggregating on the fact side before wide joins. If large sorts dominate, add appropriate ORDER BY only where needed.

Fix D: Standardize MTU and Validate Network Offloads

Align MTU across NICs, switches, and VLANs; disable problematic offloads if packet fragmentation persists.

# On Linux nodes (example)
ip link show dev eth0
ip link set dev eth0 mtu 9000

# Validate end-to-end with a don't-fragment probe
ping -M do -s 8972 <peer_node_ip>;

# Check NIC offloads
ethtool -k eth0 | grep -E '(tx|rx).*offload';

After changes, rerun heavy redistribution or exchange-heavy analytic queries and compare net_bytes_sent vs CPU usage per node.

Fix E: Harden UDF Environments

Pin interpreter versions and libraries; enforce deterministic builds per node.

-- Inside your UDF build pipeline
pip install --no-cache-dir -r requirements.txt
pip freeze | sort > requirements.lock

-- In Exasol, set UDF memory to a safe ceiling
ALTER SESSION SET SCRIPT_LANGUAGES = 'PYTHON3 JAVA R';
ALTER SESSION SET SCRIPT_MEMORY_LIMIT = 2048;  -- MB

Catch resource leaks by streaming in bounded batches and releasing buffers promptly. Log key metrics from within the UDF to a diagnostics table.

Fix F: Stabilize Authentication and Directory Queries

Ensure time sync, tune LDAP referrals, and cache group memberships for hot accounts.

# Check and enforce NTP/chrony sync
chronyc sources -v

-- Investigate slow logins
SELECT event_time, user_name, message
FROM exa_dba_audit_sessions
WHERE message LIKE '%LDAP%' OR message LIKE '%Kerberos%'
ORDER BY event_time DESC;

-- Create DB roles mirroring nested AD groups to avoid deep group expansion per login
CREATE ROLE ROLE_ANALYSTS;
GRANT ROLE ROLE_ANALYSTS TO USER JANE_D;
-- Map enterprise groups to roles via periodic sync job

If the organization uses cross-forest trusts, prefer explicit UPN suffixes and document the supported identity paths to reduce ambiguity.

Fix G: Temp and Persistence Storage Provisioning

Isolate temp volumes on fast SSD/NVMe, size for peak plus headroom, and track trends.

-- Capacity and trend monitoring
SELECT node_id, temp_db_current, temp_db_max, temp_db_peak_time
FROM exa_dba_nodes;

-- Alert when headroom falls below 20%
SELECT CASE WHEN (temp_db_max - temp_db_current) / temp_db_max < 0.2
            THEN 'ALERT' ELSE 'OK' END AS status
FROM exa_dba_nodes;

Co-locating temp with backup targets is a common anti-pattern. Separate the paths and throttle concurrent batch workloads if you cannot isolate physically.

Fix H: Statistics Governance

Automate statistics refreshes for high-churn tables and after major data loads; lock or seed statistics for stable dimensions.

-- Identify tables with stale stats (example heuristic)
SELECT table_schema, table_name, last_statistics_update, row_count
FROM exa_dba_tables
WHERE CURRENT_DATE - last_statistics_update > 3
ORDER BY last_statistics_update;

-- Refresh
ALTER TABLE MY_SCHEMA.BIG_FACT COMPUTE STATISTICS;

Combine with a deployment guard: compare plan_hash of canary queries pre- and post-refresh; if a hash changes unexpectedly, run the query in a safe sandbox before promoting.

Performance Optimization Patterns

Design Joins to Exploit Locality

Distribute large fact tables by the same high-cardinality join key; replicate tiny dimensions. This minimizes shuffles and reduces net_bytes_sent.

Predicate Pushdown and Late Materialization

Use selective filters as early as possible. Exasol's vectorized execution benefits when fewer columns are touched; avoid SELECT * in wide facts, especially before joins.

Batching and Window Functions

Compute aggregates in stages. For complex window specs, stage intermediate aggregates into temporary tables to cap memory and simplify plans.

UDF Memory Hygiene

Prefer streaming iterators over full materialization. In Python, release references to large NumPy arrays, call GC between batches for long-lived workers, and cap per-row payload sizes.

Concurrency Governors

Throttle background maintenance during critical BI windows. Set resource caps per service user and separate ETL from interactive BI users with roles and queue priorities.

Operational Checklists

After Any Node or Network Maintenance

Verify MTU symmetry and NIC offloads.
Run a synthetic exchange-heavy benchmark to compare throughput with baseline.
Check exa_dba_table_distribution for drift; schedule REORGANIZE if needed.

Before Rolling Out a New BI Model

Confirm distribution keys for new big facts align with existing stars.
Compute initial statistics and capture canary plan hashes.
Load-test with expected concurrency; inspect spill behavior.

When Introducing or Updating UDFs

Pin interpreter and dependency versions; bake immutable images for all nodes.
Set explicit SCRIPT_MEMORY_LIMIT; exercise worst-case batch sizes.
Write UDF logs to a diagnostics table with node_id, batch_id, rss_bytes.

Edge Case Scenarios and How to Handle Them

1) Sudden Performance Collapse After a Small Data Load

Cause: statistics changed cardinality estimates, leading the planner to choose a broadcast or repartition that spills. Fix: recompute statistics for all join partners, compare plan hashes, and force a distribution-preserving join if necessary while you stabilize data models.

2) Only One Node Shows Elevated Temp Writes

Cause: that node hosts the hot partitions due to incomplete rebalancing or a skewed key. Fix: REORGANIZE the table and validate evenness; if workload-specific, adopt a composite distribution key.

3) UDF Works in Dev but Fails in Prod

Cause: prod nodes have a slightly different libc, Python minor version, or kernel flag limiting file descriptors. Fix: standardize base images, validate ulimit -n, and rebuild wheels against the prod ABI; add startup self-tests that import key modules and perform a small vector operation.

4) Intermittent Logins Time Out During Peak

Cause: LDAP/Kerberos query amplification from nested groups plus GC pauses on the directory server. Fix: cache groups in roles, reduce nested depth, and ensure chrony sync; if needed, add a read-only replica KDC/LDAP near the cluster.

5) Queries Slower Only After Backup Windows

Cause: backup jobs saturate the same storage path as temp space. Fix: isolate paths, re-schedule overlap, or throttle backup throughput; validate by correlating temp_db_written spikes with backup logs.

Long-Term Architectural Solutions

Data Distribution Governance

Codify distribution key selection in the modeling standard. Provide a linter that checks candidate tables for key cardinality and alignment with major joins before DDL is accepted into mainline.

Immutable Runtime Images for UDFs

Build OCI images per language with pinned OS and library versions; sign and rollout atomically to all nodes. Embed a health probe that runs on container start and writes a pass/fail record.

Network SLOs and Continuous Validation

Publish SLOs for node-to-node throughput, latency, and MTU validity. Nightly jobs should transfer fixed-size buffers across all pairs and store metrics to a time series table for anomaly detection.

Workload Segmentation

Separate ETL, ad hoc, and dashboard workloads by roles and resource caps. Provide predictable performance for BI by limiting heavy ELT during business hours and using change-data-capture micro-batches instead of monolithic loads.

Observability as a First-Class Citizen

Continuously export exa_dba_* metrics to your observability stack. Alert on plan hash churn for top N queries, skew thresholds, and rising temp writes per node; add runbooks linking alerts to the scripts shown in this article.

Best Practices (Quick Reference)

Pick high-cardinality, join-aligned distribution keys; avoid low-selectivity business codes.
Replicate only genuinely small dimensions; switch to distributed when growth warrants.
Isolate and right-size temp volumes; monitor spill trends.
Pin and validate UDF runtimes; log resource usage per batch.
Enforce end-to-end MTU symmetry; verify after any network change.
Automate statistics refresh; gate releases on plan hash stability.
Segment workloads and set resource caps to protect BI latencies.
Continuously export and alert on exa_dba_* metrics with actionable thresholds.

Conclusion

Exasol's MPP, in-memory, columnar engine delivers exceptional analytics performance—but its very strengths amplify architectural missteps. Intermittent slowdowns typically trace back to distribution skew, incomplete rebalancing, spills from unexpected plan shifts, network MTU asymmetry, or brittle UDF environments. Senior practitioners can tame these issues by instrumenting the right system views, governing distribution keys, standardizing UDF runtimes, isolating temp storage, and treating the interconnect as part of the database. With these practices institutionalized, performance becomes consistent, maintenance predictable, and SLAs defensible even as data volumes and user counts grow.

FAQs

1. How do I distinguish planner errors from environmental issues?

Compare plan_hash and operator metrics across runs. Stable plans with rising net_errors or temp writes point to environment; changing plan hashes suggest statistics or data drift.

2. What's the fastest way to validate node-to-node health after a network change?

Run a synthetic exchange test and inspect net_bytes_sent per node alongside CPU per GB. Also verify MTU end-to-end with a DF ping to catch fragmentation.

3. When should a replicated dimension become distributed?

When its size or access frequency triggers measurable spills or pushes RAM pressure during major joins. Monitor replicated table size and switch to distributed once it crosses your replication budget.

4. How can I make UDFs resilient to OS patching?

Ship UDFs in immutable, pinned images; run a startup self-test; and store a requirements.lock to guarantee deterministic rebuilds across nodes.

5. Why did a small incremental load cause a big regression?

It likely tipped a cardinality threshold that changed the join strategy. Recompute statistics on all join partners and compare plan hashes; consider constraining the planner temporarily while you refactor.

Contact Us