Reputation: 1182
We would like to to write our data in a spark program to s3 in directories that represent our partitions.
Example: VisitFact DataFrame should be written to s3 and the partitions are date, hour, site and let's say that the specific Df has only one day (dt=2017-07-01), one hour (hh=02) and 2 sites (10, 11) so directories would be:
We need to go over the dataframe and split it into the multiple df's (two in this case)
I would like this to be generic, so list of fields that define the partition can change and is of N elements
Does spark support this natively? What would be an efficient way to accomplish this Thanks Nir
Upvotes: 1
Views: 2138
Reputation: 11479
Agreed with Nir go with partitions choose between Hash Partitioning or Range Partitioning
https://spark.apache.org/docs/latest/sql-programming-guide.html#bucketing-sorting-and-partitioning
Upvotes: 1
Reputation: 6994
Yes spark supports partitioning
You can use something like this
df.write.partitionBy("columns for partitioning").parquet("path to the top dir")
Upvotes: 4