Jamalan
Jamalan

Reputation: 580

How to partition a table in Databricks by data-size/row count not by column

I've seen databricks examples that use the partionBy method. But partitions are recommended to be 128MB. I'd think there was a way to basically achieve that as closely as possible? Take the total size, divide it by 128mb, then partition by a number of partitions rather than by a dimension.

Any suggestions for how this is achieved would be appreciated.

Upvotes: 2

Views: 761

Answers (1)

The setting spark.sql.files.maxPartitionBytes has indeed impact on the max size of the partitions when reading the data on the Spark cluster.By using this configuration we can control the partition based on the size of the data .

Upvotes: 2

Related Questions