Background: What Makes Exasol Different
MPP, In-Memory, and Columnar Fundamentals
Exasol distributes data across nodes using a shared-nothing MPP architecture. Each node keeps compressed columnar segments in RAM (with a smart page cache) and coordinates via a high-speed interconnect. Query fragments are pushed down to the nodes holding the relevant partitions; intermediate results are exchanged through a distributed execution layer. This design delivers low-latency analytics under parallelism but also means that data skew, network congestion, or node-local resource contention translate directly into query variability.
Where Day-to-Day Troubleshooting Becomes Tricky
- Elastic changes: Node additions, removals, or replacements can leave the system partially rebalanced if maintenance windows end early.
- UDF isolation: Scripting UDFs (Python/R/Java) execute in containerized sandboxes; misaligned OS libraries, memory caps, or file descriptor limits create opaque errors.
- Authentication at scale: Kerberos/LDAP/AD nuances—clock drift, referral chasing, nested groups—cause intermittent login failures hard to reproduce.
- Storage pressure: A few wide joins or accidental Cartesian products trigger aggressive spilling, changing the performance profile overnight.
Architecture: Components That Influence Troubleshooting
Cluster Topology and Data Distribution
Tables are either distributed by a hash of the distribution key or replicated (for small dimension tables). Hash distribution drives local join efficiency; picking a poor key yields skew and cross-node traffic. System metadata records partition ownership per node and influences the planner's fragment placement.
Buffer Cache, Temp Space, and Spill Paths
Exasol's smart page cache keeps hot columnar chunks in RAM. When operators exceed memory grants, the engine spills to temp volumes. Incorrect tempfs sizing, shared underlying storage with other services, or noisy neighbors on virtualized disks can turn a rare spill into a systemic slowdown.
Execution, Exchange, and Network MTU
Distributed plans rely on high-throughput node-to-node exchange. Path MTU mismatches, jumbo frame asymmetry, or misconfigured NIC offloads increase fragmentation and CPU per byte transferred. These issues arise especially after switch replacement or mixed 25/40/100 GbE fabrics.
UDF Runtime Sandboxes
UDFs run in separate containers or processes with configurable memory/CPU quotas. The orchestration decides placement on a node, mounts temporary directories, and exposes connector libraries. ABI or Python wheel incompatibilities surface only under specific data shapes or larger batches.
Diagnostics: A Senior Engineer's Playbook
1) Confirm Whether the Problem Is Local, Skewed, or Systemic
-- Identify long-running or recently failed statements SELECT session_id, user_name, statement_id, sql_text, start_time, run_time FROM exa_dba_profile_last ORDER BY run_time DESC FETCH FIRST 20 ROWS ONLY; -- Check per-node contribution for a single statement SELECT node_id, cpu, temp_db_reads, temp_db_writes, net_bytes_sent, net_bytes_rcvd FROM exa_dba_profile_last_detail WHERE statement_id = <ID> ORDER BY node_id;
If one or two nodes do disproportionate work, suspect data skew, a straggler node (IO/network), or UDF placement issues. If all nodes slow down together, suspect global spill or authentication delays.
2) Spot Data Skew and Distribution Key Mismatches
-- Distribution inspection for a large table SELECT node_id, COUNT(*) AS rows_on_node FROM (SELECT exa_node_id AS node_id FROM MY_SCHEMA.BIG_FACT) GROUP BY node_id ORDER BY rows_on_node DESC; -- Check candidate key distinctness for a proposed distribution column SELECT COUNT(DISTINCT candidate_col) / COUNT(*) AS distinct_ratio FROM MY_SCHEMA.BIG_FACT;
Row count variance >= 20% across nodes for a big fact table commonly indicates a poor distribution key. A very low distinct_ratio
signals hot partitions.
3) Determine If You Are Spilling and Why
-- Temp space usage snapshots SELECT node_id, temp_db_current, temp_db_max, temp_db_peak_time FROM exa_dba_nodes; -- Operator-level indicators for the last statement SELECT operator_id, operator_name, temp_db_read, temp_db_written, mem_peak FROM exa_dba_operators WHERE statement_id = <ID> ORDER BY temp_db_written DESC;
High temp_db_written
on join/sort operators reveals memory pressure. If the spike lines up with a nightly batch, suspect a new query plan or changed statistics.
4) Validate Execution Plan Stability
-- Capture and compare plans EXPLAIN SELECT ...; -- Plan hash for regression checks SELECT statement_id, plan_id, plan_hash, start_time FROM exa_dba_profile_last WHERE sql_text LIKE 'SELECT ...' ORDER BY start_time DESC;
Plan hash churn across identical SQL usually points to stale or missing statistics, a change in table cardinality, or a newly added node changing estimated exchange costs.
5) Network Health, MTU, and Jumbo Frames
-- Network indicators per node SELECT node_id, net_bytes_sent, net_bytes_rcvd, net_errors FROM exa_dba_nodes ORDER BY net_errors DESC; -- During a problem window, compare exchange-heavy queries SELECT statement_id, SUM(net_bytes_sent) AS sent, SUM(net_bytes_rcvd) AS rcvd FROM exa_dba_profile_last_detail GROUP BY statement_id ORDER BY sent DESC;
Rising net_errors
after switch maintenance, combined with inflated CPU per exchanged GB, often implies MTU mismatch. Validate NIC and switch MTU end-to-end.
6) UDF Runtime Failures and Memory Caps
-- Most recent UDF errors SELECT start_time, user_name, schema_name, udf_name, error_message FROM exa_dba_audit_udf ORDER BY start_time DESC FETCH FIRST 50 ROWS ONLY; -- Node placement for a failing UDF call SELECT node_id, session_id, statement_id FROM exa_dba_profile_last_detail WHERE statement_id = <ID>;
Look for out of memory or cannot import messages. Failures confined to a node suggest a container runtime or library mismatch on that host.
7) Authentication and Directory Integration
-- Authentication events (if auditing enabled) SELECT event_time, user_name, success, auth_method, client_host, message FROM exa_dba_audit_sessions ORDER BY event_time DESC; -- Sessions waiting unusually long to start SELECT session_id, user_name, connect_time, state, wait_event FROM exa_dba_sessions WHERE state <> 'ACTIVE' ORDER BY connect_time DESC;
Intermittent failures correlating with KDC timeouts or LDAP referral loops reveal identity infrastructure, not database, as the root.
Pitfalls and Anti-Patterns
Choosing a Non-Selective Distribution Key
It seems intuitive to distribute by a business key, but low-cardinality keys (e.g., country_code) create hot partitions. Prefer high-cardinality, join-friendly keys shared across major fact tables.
Excessive Replication of Medium Tables
Replicating "small" tables that slowly grow beyond RAM budgets forces silent spills during joins. Monitor replicated table sizes and switch them to distributed once they cross a safe threshold.
Under-Provisioned Temp Volumes
Storing temp on shared disks with other analytics engines or backups multiplies jitter under batch contention. Temp should be isolated, fast, and sized for peak worst-case joins.
Unpinned UDF Environments
Installing Python wheels without pinning versions leads to random import errors across nodes after OS patching. Freeze per-node environments; rebuild deterministically.
Ignoring MTU Consistency
Mixed MTU across L2/L3 boundaries causes silent fragmentation and CPU burn. Always validate end-to-end MTU symmetry after any network change.
Step-by-Step Fixes
Fix A: Rebalance After Topology Changes
After adding or replacing nodes, confirm that data redistributes evenly. If maintenance ends early, you may have residual skew.
-- Check table-level distribution skew SELECT table_schema, table_name, node_id, rows_on_node FROM exa_dba_table_distribution ORDER BY table_schema, table_name, rows_on_node DESC; -- Trigger or resume rebalancing for specific tables (example approach) ALTER TABLE MY_SCHEMA.BIG_FACT REORGANIZE; -- Or for the entire schema, iterate via admin scripting
Schedule REORGANIZE
during low-traffic windows and track progress by sampling exa_dba_table_distribution
. Post-verify with plan times for representative joins.
Fix B: Repair Data Skew via Distribution Key Redesign
Identify hot joins and select a distribution key that aligns with the largest, most frequent join. Favor a high-cardinality column or a composite hash.
-- Evaluate candidate join columns for key selection SELECT COUNT(DISTINCT a.join_key) AS d_a, COUNT(*) AS n_a, COUNT(DISTINCT b.join_key) AS d_b, COUNT(*) AS n_b FROM FACT_A a JOIN FACT_B b ON a.join_key = b.join_key; -- Implement a new distribution key ALTER TABLE MY_SCHEMA.BIG_FACT DISTRIBUTE BY HASH(join_key); ALTER TABLE MY_SCHEMA.BIG_FACT REORGANIZE;
Test critical queries before and after in a canary schema. Diff plan_hash
and network bytes exchanged.
Fix C: Reduce Spill Through Memory-Aware Joins and Sorts
Rewrite queries to reduce intermediate cardinalities and leverage replicated dimensions where appropriate.
-- Example: early filter to reduce join size WITH f AS ( SELECT * FROM FACT_SALES WHERE sale_date BETWEEN DATE '2025-06-01' AND DATE '2025-06-30' ) SELECT /*+ USE_REPL_TABLE(dim_customers) */ f.*, d.segment FROM f JOIN dim_customers d ON f.customer_id = d.customer_id; -- Monitor operator temp writes SELECT operator_id, operator_name, temp_db_written FROM exa_dba_operators WHERE statement_id = <ID> AND temp_db_written > 0;
Consider pre-aggregating on the fact side before wide joins. If large sorts dominate, add appropriate ORDER BY
only where needed.
Fix D: Standardize MTU and Validate Network Offloads
Align MTU across NICs, switches, and VLANs; disable problematic offloads if packet fragmentation persists.
# On Linux nodes (example) ip link show dev eth0 ip link set dev eth0 mtu 9000 # Validate end-to-end with a don't-fragment probe ping -M do -s 8972 <peer_node_ip>; # Check NIC offloads ethtool -k eth0 | grep -E '(tx|rx).*offload';
After changes, rerun heavy redistribution or exchange-heavy analytic queries and compare net_bytes_sent
vs CPU usage per node.
Fix E: Harden UDF Environments
Pin interpreter versions and libraries; enforce deterministic builds per node.
-- Inside your UDF build pipeline pip install --no-cache-dir -r requirements.txt pip freeze | sort > requirements.lock -- In Exasol, set UDF memory to a safe ceiling ALTER SESSION SET SCRIPT_LANGUAGES = 'PYTHON3 JAVA R'; ALTER SESSION SET SCRIPT_MEMORY_LIMIT = 2048; -- MB
Catch resource leaks by streaming in bounded batches and releasing buffers promptly. Log key metrics from within the UDF to a diagnostics table.
Fix F: Stabilize Authentication and Directory Queries
Ensure time sync, tune LDAP referrals, and cache group memberships for hot accounts.
# Check and enforce NTP/chrony sync chronyc sources -v -- Investigate slow logins SELECT event_time, user_name, message FROM exa_dba_audit_sessions WHERE message LIKE '%LDAP%' OR message LIKE '%Kerberos%' ORDER BY event_time DESC; -- Create DB roles mirroring nested AD groups to avoid deep group expansion per login CREATE ROLE ROLE_ANALYSTS; GRANT ROLE ROLE_ANALYSTS TO USER JANE_D; -- Map enterprise groups to roles via periodic sync job
If the organization uses cross-forest trusts, prefer explicit UPN suffixes and document the supported identity paths to reduce ambiguity.
Fix G: Temp and Persistence Storage Provisioning
Isolate temp volumes on fast SSD/NVMe, size for peak plus headroom, and track trends.
-- Capacity and trend monitoring SELECT node_id, temp_db_current, temp_db_max, temp_db_peak_time FROM exa_dba_nodes; -- Alert when headroom falls below 20% SELECT CASE WHEN (temp_db_max - temp_db_current) / temp_db_max < 0.2 THEN 'ALERT' ELSE 'OK' END AS status FROM exa_dba_nodes;
Co-locating temp with backup targets is a common anti-pattern. Separate the paths and throttle concurrent batch workloads if you cannot isolate physically.
Fix H: Statistics Governance
Automate statistics refreshes for high-churn tables and after major data loads; lock or seed statistics for stable dimensions.
-- Identify tables with stale stats (example heuristic) SELECT table_schema, table_name, last_statistics_update, row_count FROM exa_dba_tables WHERE CURRENT_DATE - last_statistics_update > 3 ORDER BY last_statistics_update; -- Refresh ALTER TABLE MY_SCHEMA.BIG_FACT COMPUTE STATISTICS;
Combine with a deployment guard: compare plan_hash
of canary queries pre- and post-refresh; if a hash changes unexpectedly, run the query in a safe sandbox before promoting.
Performance Optimization Patterns
Design Joins to Exploit Locality
Distribute large fact tables by the same high-cardinality join key; replicate tiny dimensions. This minimizes shuffles and reduces net_bytes_sent
.
Predicate Pushdown and Late Materialization
Use selective filters as early as possible. Exasol's vectorized execution benefits when fewer columns are touched; avoid SELECT *
in wide facts, especially before joins.
Batching and Window Functions
Compute aggregates in stages. For complex window specs, stage intermediate aggregates into temporary tables to cap memory and simplify plans.
UDF Memory Hygiene
Prefer streaming iterators over full materialization. In Python, release references to large NumPy arrays, call GC between batches for long-lived workers, and cap per-row payload sizes.
Concurrency Governors
Throttle background maintenance during critical BI windows. Set resource caps per service user and separate ETL from interactive BI users with roles and queue priorities.
Operational Checklists
After Any Node or Network Maintenance
- Verify MTU symmetry and NIC offloads.
- Run a synthetic exchange-heavy benchmark to compare throughput with baseline.
- Check
exa_dba_table_distribution
for drift; scheduleREORGANIZE
if needed.
Before Rolling Out a New BI Model
- Confirm distribution keys for new big facts align with existing stars.
- Compute initial statistics and capture canary plan hashes.
- Load-test with expected concurrency; inspect spill behavior.
When Introducing or Updating UDFs
- Pin interpreter and dependency versions; bake immutable images for all nodes.
- Set explicit
SCRIPT_MEMORY_LIMIT
; exercise worst-case batch sizes. - Write UDF logs to a diagnostics table with node_id, batch_id, rss_bytes.
Edge Case Scenarios and How to Handle Them
1) Sudden Performance Collapse After a Small Data Load
Cause: statistics changed cardinality estimates, leading the planner to choose a broadcast or repartition that spills. Fix: recompute statistics for all join partners, compare plan hashes, and force a distribution-preserving join if necessary while you stabilize data models.
2) Only One Node Shows Elevated Temp Writes
Cause: that node hosts the hot partitions due to incomplete rebalancing or a skewed key. Fix: REORGANIZE
the table and validate evenness; if workload-specific, adopt a composite distribution key.
3) UDF Works in Dev but Fails in Prod
Cause: prod nodes have a slightly different libc, Python minor version, or kernel flag limiting file descriptors. Fix: standardize base images, validate ulimit -n
, and rebuild wheels against the prod ABI; add startup self-tests that import key modules and perform a small vector operation.
4) Intermittent Logins Time Out During Peak
Cause: LDAP/Kerberos query amplification from nested groups plus GC pauses on the directory server. Fix: cache groups in roles, reduce nested depth, and ensure chrony sync; if needed, add a read-only replica KDC/LDAP near the cluster.
5) Queries Slower Only After Backup Windows
Cause: backup jobs saturate the same storage path as temp space. Fix: isolate paths, re-schedule overlap, or throttle backup throughput; validate by correlating temp_db_written
spikes with backup logs.
Long-Term Architectural Solutions
Data Distribution Governance
Codify distribution key selection in the modeling standard. Provide a linter that checks candidate tables for key cardinality and alignment with major joins before DDL is accepted into mainline.
Immutable Runtime Images for UDFs
Build OCI images per language with pinned OS and library versions; sign and rollout atomically to all nodes. Embed a health probe that runs on container start and writes a pass/fail record.
Network SLOs and Continuous Validation
Publish SLOs for node-to-node throughput, latency, and MTU validity. Nightly jobs should transfer fixed-size buffers across all pairs and store metrics to a time series table for anomaly detection.
Workload Segmentation
Separate ETL, ad hoc, and dashboard workloads by roles and resource caps. Provide predictable performance for BI by limiting heavy ELT during business hours and using change-data-capture micro-batches instead of monolithic loads.
Observability as a First-Class Citizen
Continuously export exa_dba_*
metrics to your observability stack. Alert on plan hash churn for top N queries, skew thresholds, and rising temp writes per node; add runbooks linking alerts to the scripts shown in this article.
Best Practices (Quick Reference)
- Pick high-cardinality, join-aligned distribution keys; avoid low-selectivity business codes.
- Replicate only genuinely small dimensions; switch to distributed when growth warrants.
- Isolate and right-size temp volumes; monitor spill trends.
- Pin and validate UDF runtimes; log resource usage per batch.
- Enforce end-to-end MTU symmetry; verify after any network change.
- Automate statistics refresh; gate releases on plan hash stability.
- Segment workloads and set resource caps to protect BI latencies.
- Continuously export and alert on
exa_dba_*
metrics with actionable thresholds.
Conclusion
Exasol's MPP, in-memory, columnar engine delivers exceptional analytics performance—but its very strengths amplify architectural missteps. Intermittent slowdowns typically trace back to distribution skew, incomplete rebalancing, spills from unexpected plan shifts, network MTU asymmetry, or brittle UDF environments. Senior practitioners can tame these issues by instrumenting the right system views, governing distribution keys, standardizing UDF runtimes, isolating temp storage, and treating the interconnect as part of the database. With these practices institutionalized, performance becomes consistent, maintenance predictable, and SLAs defensible even as data volumes and user counts grow.
FAQs
1. How do I distinguish planner errors from environmental issues?
Compare plan_hash
and operator metrics across runs. Stable plans with rising net_errors
or temp writes point to environment; changing plan hashes suggest statistics or data drift.
2. What's the fastest way to validate node-to-node health after a network change?
Run a synthetic exchange test and inspect net_bytes_sent
per node alongside CPU per GB. Also verify MTU end-to-end with a DF ping to catch fragmentation.
3. When should a replicated dimension become distributed?
When its size or access frequency triggers measurable spills or pushes RAM pressure during major joins. Monitor replicated table size and switch to distributed once it crosses your replication budget.
4. How can I make UDFs resilient to OS patching?
Ship UDFs in immutable, pinned images; run a startup self-test; and store a requirements.lock
to guarantee deterministic rebuilds across nodes.
5. Why did a small incremental load cause a big regression?
It likely tipped a cardinality threshold that changed the join strategy. Recompute statistics on all join partners and compare plan hashes; consider constraining the planner temporarily while you refactor.