Understanding Redshift Internals
Columnar Storage and MPP Architecture
Redshift is built on a columnar storage model with massively parallel processing (MPP) across compute nodes. Query performance depends heavily on minimizing data movement and maximizing local operations across slices within nodes.
Role of Sort and Distribution Keys
Sort keys define how data is ordered on disk, affecting query predicate filtering. Distribution keys determine how data is split across nodes—crucial for join efficiency and aggregation performance.
CREATE TABLE sales ( sale_id BIGINT, customer_id INT, sale_date DATE, amount DECIMAL(10,2) ) DISTSTYLE KEY DISTKEY(customer_id) SORTKEY(sale_date);
Common Symptoms of Key Misconfiguration
- Query runtime increases significantly over time
- Frequent disk-based queries due to sort key misalignment
- High skew in slice-level data distribution
- Join-heavy queries spill to disk or require massive reshuffling
Diagnostic Strategy
1. Analyze Table Metadata
Use SVV_TABLE_INFO and SVL_QLOG to identify skew and sort key usage:
SELECT "table", diststyle, skew_rows, unsorted FROM svv_table_info WHERE skew_rows > 1.5 OR unsorted > 20;
High unsorted
percentages suggest ineffective sort keys.
2. Investigate Query Plans
Run EXPLAIN on slow queries. Look for signs of:
- DS_BCAST_INNER (broadcast joins)
- DS_DIST_ALL_NONE (no local joins possible)
- Intermediate spill to disk
3. Check Distribution Skew
Use SVL_QUERY_SUMMARY to analyze how data is distributed across slices:
SELECT query, is_diskbased, workmem, rows, bytes, max_blocks_read FROM svl_query_summary WHERE is_diskbased = 't';
If a few slices read significantly more blocks, re-evaluate distribution strategy.
Architectural Implications
Bad Joins Due to Distribution Mismatch
When tables with mismatched distribution keys are joined, Redshift often broadcasts one side of the join to all nodes—causing significant overhead.
Unsorted Data Hurts Predicate Pushdown
If predicates do not match the leading column of the sort key, Redshift scans more blocks. Over time, without a vacuum or proper sort key, tables become fragmented.
Step-by-Step Remediation
Step 1: Identify Critical Query Patterns
Use Redshift system views or third-party monitoring tools (e.g., AWS CloudWatch, Periscope) to discover:
- Most frequent join paths
- High-cost WHERE conditions
- Large aggregations by group
Step 2: Redesign Sort Keys
Align sort keys with query filter columns—especially columns used in range scans (e.g., dates). Use compound keys for temporal workloads and interleaved keys for multi-dimensional filtering.
CREATE TABLE orders_sorted ( order_id INT, customer_id INT, order_date DATE ) DISTKEY(customer_id) COMPOUND SORTKEY(order_date);
Step 3: Adjust Distribution Strategy
- Use
DISTSTYLE KEY
for frequent joins with shared keys - Use
ALL
for small dimension tables - Avoid
DISTSTYLE EVEN
for join-heavy workloads
Step 4: Vacuum and Analyze
After schema changes, perform:
VACUUM FULL orders_sorted; ANALYZE orders_sorted;
This reclaims space and updates planner statistics for query optimization.
Long-Term Best Practices
- Continuously monitor
svv_table_info
for skew and unsorted metrics - Document distribution and sort key rationale per table
- Reassess keys quarterly based on query pattern evolution
- Use workload management queues (WLM) to isolate heavy queries
- Automate vacuum and analyze on large inserts or ETL loads
Conclusion
Suboptimal distribution and sort key design in Amazon Redshift is a silent performance killer in enterprise environments. While Redshift offers fast parallel query execution, its efficiency depends on how well your data model matches workload access patterns. By proactively analyzing metadata, tuning keys, and enforcing vacuum hygiene, teams can reclaim performance and ensure Redshift remains responsive under load.
FAQs
1. How do I choose between compound and interleaved sort keys?
Use compound for consistent filtering on the leading column; interleaved is better for multi-dimensional filtering but comes with vacuuming overhead.
2. What is a good skew ratio threshold?
Skew ratios above 1.5 suggest imbalance. Aim for skew_rows
< 1.2 for uniform distribution across slices.
3. Can I change distribution or sort key without recreating a table?
No, Redshift requires a new table to be created with the new schema. Use CTAS (Create Table As Select) pattern to migrate data efficiently.
4. How often should I run VACUUM?
After large DML operations or ETL batches. Automate via scheduled jobs or event triggers on INSERT frequency.
5. How does Redshift Spectrum impact sort/distribution strategy?
Spectrum queries external data, so distribution keys are irrelevant. However, sort keys still help if you COPY data into Redshift for performance-critical queries.