Reputation: 2160
I've been using partitionBy
but I don't quite get why we should use this.
I have a csv record like this :
--------------------------- ---------
name | age | entranceDate | dropDate |
--------------------------------------
Tom | 12 | 2019-10-01 | null |
--------------------------------------
Mary | 15 | 2019-10-01 | null |
--------------------------------------
What would happen if I use :
String[] partitions =
new String[] {
"name",
"entranceDate"
};
df.write()
.partitionBy(partitions)
.mode(SaveMode.Append)
.parquet(parquetPath);
And what if I partition on null
column :
String[] partitions =
new String[] {
"name",
"dropDate"
};
df.write()
.partitionBy(partitions)
.mode(SaveMode.Append)
.parquet(parquetPath);
Could anyone explain how it works ? Thanks.
Upvotes: 0
Views: 1207
Reputation: 1339
DataFrameWriter's partitionBy takes independently current DataFrame partitions and writes each partition splitted by the unique values of the columns passed.
Let's take your example and assume that we already have two DF partitions and we want to partitionBy()
only with one column - name
.
Partition 1
--------------------------- ---------
name | age | entranceDate | dropDate |
--------------------------------------
Tom | 12 | 2019-10-01 | null |
--------------------------------------
Mary | 15 | 2019-10-01 | null |
--------------------------------------
Tom | 15 | 2019-10-01 | null |
--------------------------------------
Partition 2
--------------------------- ---------
name | age | entranceDate | dropDate |
--------------------------------------
Tom | 12 | 2019-10-01 | null |
--------------------------------------
Tom | 15 | 2019-10-01 | null |
--------------------------------------
Tom | 15 | 2019-10-01 | null |
--------------------------------------
In this case three files would be created. Two files for first partition, one for Tom, one for Mary, and one file for second partition - cause there is only Tom data there.
In case of several columns partitionBy() takes the combination of values.
Upvotes: 1
Reputation: 1265
The behavior of df.write.partitionBy works as follows:
- For every partition of the dataframe, get the unique values of the columns in partitionBy argument
- Write the data for every unique combination in a different file
In your example above, let us say your dataframe has 10 partitions. Let us assume that partitions 1-5 have 5 unique combinations of name and entrance date, partitions 6-10 have 10 unique combination of name and entrance date. Each combination of name and entrance date will be written as a different file. Thus partitions 1-5 each will be written to 5 files and partitions 6-10, each will be split-ted in 10 files. The total number of files generated by the write operation will be 5*5 + 5*10 = 75. partitionBy looks at the unique values of the combination of columns. From the documentation of the api:
Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. As an example, when we partition a dataset by year and then month, the directory layout would look like: - year=2016/month=01/ - year=2016/month=02/
Partitioning is one of the most widely used techniques to optimize physical data layout. It provides a coarse-grained index for skipping unnecessary data reads when queries have predicates on the partitioned columns. In order for partitioning to work well, the number of distinct values in each column should typically be less than tens of thousands.
This is applicable for all file-based data sources (e.g. Parquet, JSON) starting with Spark 2.1.0.
A one of the columns in the partitionBy clause has a same value for all rows, then the data will be split based on the values of the other columns in the partitionBy argument.
Upvotes: 2