Optimize a Delta table used to make queries that use GROUP BY

Question

Working in a Delta table. When it comes to optimising my delta table, I have learned a few things.

Partition by: Is beneficial when the column in which the partition is made is used in a "where" condition. i.e In a previous step. Let's say this is table A

df.write.partitionBy("column_1").format("delta").mode("overwrite").save("path")

I will use the Delta table like this in the future.

%sql
select 
    colum1
    column2
    ...
from TableA
where column1 = "XX"

Bloom Filters: Are beneficial for joins between tables. Imagine column_1 is our id column ,so you will be use it in that way

%sql

CREATE BLOOMFILTER INDEX ON TABLE TableA FOR COLUMNS(column_1)
OPTIMIZE TableA  ZORDER BY (column_1)


select 
    colum1
    column2
    ...
from TableA a
INNER JOIN  TableB b
ON  a.column_1 = b.column_1

But how can I optimise a table where Group by queries are to be used and no optimisation techniques such as Partition By or Bloom Filter are applied ? i.e. the table will be used as follows

%sql
select 
   colum1
   column2
   ...
from TableA
group by (column1, column2)

Thanks in advance

Alex Ott · Accepted Answer

OPTIMIZE ... ZORDER BY (column1, column2) may help colocating related values in smaller number of files, and make querying of data more efficient. (see doc)

Optimize a Delta table used to make queries that use GROUP BY

Answers (2)

Related Questions