Saurabh Agrawal
Saurabh Agrawal

Reputation: 1365

Redshift: Aggregate data on large number of dimensions is slow

I have an Amazon redshift table with about 400M records and 100 columns - 80 dimensions and 20 metrics.

Table is distributed by 1 of the high cardinality dimension columns and includes a couple of high cardinality columns in sort key.

A simple aggregate query:

Select dim1, dim2...dim60, sum(met1),...sum(met15)
From my table
Group by dim1...dim60 

is taking too long. The explain plan looks simple just a sequential scan and hashaggregate on the able. Any recommendations on how I can optimize it?

Upvotes: 1

Views: 2334

Answers (2)

AlexYes
AlexYes

Reputation: 4208

1) If your table is heavily denormalized (your 80 dimensions are in fact 20 dimensions with 4 attributes each) it is faster to group by dimension keys only, and if you really need all dimension attributes join the aggregated result back to dimension tables to get them, like this:

with 
groups as (
    select dim1_id,dim2_id,...,dim20_id,sum(met1),sum(met2)
    from my_table
    group by 1,2,...,20
)
select *
from groups
join dim1_table
using (dim1_id)
join dim2_table
using (dim2_id)
...
join dim20_table
using (dim20_id)

If you don't want to normalize your table and you like that a single row has all pieces of information it's fine to keep it as is since in a column database they won't slow the queries down if you don't use them. But grouping by 80 columns is definitely inefficient and has to be "pseudo-normalized" in the query.

2) if your dimensions are hierarchical you can group by the lowest level only and then join higher level dimension attributes. For example, if you have country, country region and city with 4 attributes each there's no need to group by 12 attributes, all you can do is group by city ID and then join city's attributes, country region and country tables to the city ID of each group

3) you can have the combination of dimension IDs with some delimiter like - in a separate varchar column and use that as a sort key

Upvotes: 1

John Rotenstein
John Rotenstein

Reputation: 269370

Sequential scans are quite normal for Amazon Redshift. Instead of using indexes (which themselves would be Big Data), Redshift uses parallel clusters, compression and columnar storage to provide fast queries.

Normally, optimization is done via:

  • DISTKEY: Typically used on the most-JOINed column (or most GROUPed column) to localize joined data on the same node.
  • SORTKEY: Typically used for fields that most commonly appear in WHERE statements to quickly skip over storage blocks that do not contain relevant data.
  • Compression: Redshift automatically compresses data, but over time the skew of data could change, making another compression type more optimal.

Your query is quite unusual in that you are using GROUP BY on 60 columns across all rows in the table. This is not a typical Data Warehousing query (where rows are normally limited by WHERE and tables are connected by JOIN).

I would recommend experimenting with fewer GROUP BY columns and breaking the query down into several smaller queries via a WHERE clause to determine what is occupying most of the time. Worst case, you could run the results nightly and store them in a table for later querying.

Upvotes: 1

Related Questions