Ana Isael Morales
Ana Isael Morales

Reputation: 120

Optimize a Delta table used to make queries that use GROUP BY

Working in a Delta table. When it comes to optimising my delta table, I have learned a few things.

Partition by: Is beneficial when the column in which the partition is made is used in a "where" condition. i.e In a previous step. Let's say this is table A

df.write.partitionBy("column_1").format("delta").mode("overwrite").save("path")

I will use the Delta table like this in the future.

%sql
select 
    colum1
    column2
    ...
from TableA
where column1 = "XX"

Bloom Filters: Are beneficial for joins between tables. Imagine column_1 is our id column ,so you will be use it in that way

%sql

CREATE BLOOMFILTER INDEX ON TABLE TableA FOR COLUMNS(column_1)
OPTIMIZE TableA  ZORDER BY (column_1)


select 
    colum1
    column2
    ...
from TableA a
INNER JOIN  TableB b
ON  a.column_1 = b.column_1

But how can I optimise a table where Group by queries are to be used and no optimisation techniques such as Partition By or Bloom Filter are applied ? i.e. the table will be used as follows

%sql
select 
   colum1
   column2
   ...
from TableA
group by (column1, column2)

Thanks in advance

Upvotes: 3

Views: 1372

Answers (2)

Kashyap
Kashyap

Reputation: 17411

In addition to ZORDER as suggested.

Few more things:

  • You can also fine tune the file sizes.
  • Apply compaction (OPTIMIZE) periodically, especially if you're streaming or dealing with small batch sizes. I guess this is implicit if you're using ZORDER as other post suggested.
  • Same page also has many other suggestions including specific ones if you're trying to improve interactive query performance.
  • For interactive queries, also note that the "SQL Warehouse" has it's Spark configuration set up for optimal interactive performance, which will be better than a general purpose cluster.

Upvotes: 2

Alex Ott
Alex Ott

Reputation: 87119

OPTIMIZE ... ZORDER BY (column1, column2) may help colocating related values in smaller number of files, and make querying of data more efficient. (see doc)

Upvotes: 2

Related Questions