user7298979
user7298979

Reputation: 549

repartitioning by multiple columns for Pyspark dataframe

EDIT: adding more context to the question now that I reread the post again:

Let's say I have a pyspark dataframe that I am working with and currently I can repartition the dataframe as such:

dataframe.repartition(200, col_name)

And I write that partitioned dataframe out to a parquet file. When reading the directory, I see that the directory in the warehouse is partitioned the way I want:

/apps/hive/warehouse/db/DATE/col_name=1
/apps/hive/warehouse/db/DATE/col_name=2

I want to understand how I can repartition this in multiple layers, meaning I partition one column for the top level partition, a second column for the second level partition, and a third column for the third level partition. Is it as easy as adding a partitionBy() to a write method?

dataframe.mode("overwrite").partitionBy("col_name1","col_name2","col_name3")

Thus creating the directories as such?

/apps/hive/warehouse/db/DATE/col_name1=1
|--------------------------------------->/col_name2=1
|--------------------------------------------------->/col_name3=1

If so, can I use a partitionBy() to write out a max number of files per partition?

Upvotes: 4

Views: 10915

Answers (1)

Ramdev Sharma
Ramdev Sharma

Reputation: 1014

Repartition

Function repartition will control memory partition of data. If you specify repartition as 200 then in memory you will have 200 partitions.

Physical Partition on file system

Function partitionBy with given columns list control directory structure. Physical partitions will be created based on column name and column value. Each partition can create as many files as specified in repartition (default 200) will be created provided you have enough data to write.

This is sample example based on your question.

dataframe.
repartition(200).
write.mode("overwrite").
partitionBy("col_name1","col_name2","col_name3")

It will give 200 files in each partition and partitions will be created based on given order.

Upvotes: 5

Related Questions