Dividing a DafaFrame in spark to multiple DataFrames and writing to directories

Question

We would like to to write our data in a spark program to s3 in directories that represent our partitions.

Example: VisitFact DataFrame should be written to s3 and the partitions are date, hour, site and let's say that the specific Df has only one day (dt=2017-07-01), one hour (hh=02) and 2 sites (10, 11) so directories would be:

visits/dt=2017-07-01/hh=02/site_id=10
visits/dt=2017-07-01/hh=02/site_id=11

We need to go over the dataframe and split it into the multiple df's (two in this case)

I would like this to be generic, so list of fields that define the partition can change and is of N elements

Does spark support this natively? What would be an efficient way to accomplish this Thanks Nir

Avishek Bhattacharya · Accepted Answer

Yes spark supports partitioning

You can use something like this

df.write.partitionBy("columns for partitioning").parquet("path to the top dir")

Dividing a DafaFrame in spark to multiple DataFrames and writing to directories

Answers (2)

Related Questions