Reputation: 848
In Spark, when we read files which are written either using partitionBy or bucketBy, how spark identifies that they are of such sort (partitionBy/bucketBy) and accordingly the read operation becomes efficient ? Can someone please explain. Thanks in advance!
Upvotes: 0
Views: 1481
Reputation: 445
Partitions are identified using the directory structure like - year=2020/month=1/date=2
. If a table was created when saving the dataframe, then partition information can also be retrieved from table metadata.
For bucketing, spark will not identify that data is bucketed (Reference). In case tables were created while writing, table metadata will be used to know if data is bucketed.
Upvotes: 0
Reputation: 18098
Two different things. Here https://mapr.com/blog/tips-and-best-practices-to-take-advantage-of-spark-2-x/ an excellent excerpt from poor little mapR, let's hope HP makes something of it. Reading this will give you the whole context. Excellent read BTW.
Two different things in reality:
When partition filters are present, the Catalyst optimizer pushes down the partition filters from the given query. The scan reads only the directories that match the partition filters, thus reducing disk I/O. Performance improvement in relation to query, sec.
Bucketing is another data organization technique that groups data with the same bucket value across a fixed number of “buckets.” This can improve performance in wide transformations and joins by avoiding “shuffles.”
Upvotes: 3