Reputation: 120
Working in a Delta table. When it comes to optimising my delta table, I have learned a few things.
Partition by:
Is beneficial when the column in which the partition is made is used in a "where" condition.
i.e In a previous step. Let's say this is table A
df.write.partitionBy("column_1").format("delta").mode("overwrite").save("path")
I will use the Delta table like this in the future.
%sql
select
colum1
column2
...
from TableA
where column1 = "XX"
Bloom Filters:
Are beneficial for joins between tables. Imagine column_1 is our id column ,so you will be use it in that way
%sql
CREATE BLOOMFILTER INDEX ON TABLE TableA FOR COLUMNS(column_1)
OPTIMIZE TableA ZORDER BY (column_1)
select
colum1
column2
...
from TableA a
INNER JOIN TableB b
ON a.column_1 = b.column_1
But how can I optimise a table where Group by queries are to be used and no optimisation techniques such as Partition By or Bloom Filter are applied ? i.e. the table will be used as follows
%sql
select
colum1
column2
...
from TableA
group by (column1, column2)
Thanks in advance
Upvotes: 3
Views: 1372
Reputation: 17411
In addition to ZORDER
as suggested.
Few more things:
OPTIMIZE
) periodically, especially if you're streaming or dealing with small batch sizes. I guess this is implicit if you're using ZORDER
as other post suggested.Upvotes: 2