Nir Ben Yaacov
Nir Ben Yaacov

Reputation: 1182

Dividing a DafaFrame in spark to multiple DataFrames and writing to directories

We would like to to write our data in a spark program to s3 in directories that represent our partitions.

Example: VisitFact DataFrame should be written to s3 and the partitions are date, hour, site and let's say that the specific Df has only one day (dt=2017-07-01), one hour (hh=02) and 2 sites (10, 11) so directories would be:

We need to go over the dataframe and split it into the multiple df's (two in this case)

I would like this to be generic, so list of fields that define the partition can change and is of N elements

Does spark support this natively? What would be an efficient way to accomplish this Thanks Nir

Upvotes: 1

Views: 2138

Answers (2)

vaquar khan
vaquar khan

Reputation: 11479

Agreed with Nir go with partitions choose between Hash Partitioning or Range Partitioning

https://spark.apache.org/docs/latest/sql-programming-guide.html#bucketing-sorting-and-partitioning

Upvotes: 1

Avishek Bhattacharya
Avishek Bhattacharya

Reputation: 6994

Yes spark supports partitioning

You can use something like this

df.write.partitionBy("columns for partitioning").parquet("path to the top dir")

Upvotes: 4

Related Questions