Jay
Jay

Reputation: 326

creating a single parquet file in s3 pyspark job

I have written a pyspark program that is reading data from cassandra and writing into aws s3 . Before writing into s3 I have to do repartition(1) or coalesce(1) as this creates one single file otherwise it creates multiple parquet files in s3 . using repartition(1) or coalesce(1) has performance issue and I feel creating one big partition is not good option with huge data . what are ways to create one single file in s3 but without compromising on performance ?

Upvotes: 0

Views: 1404

Answers (1)

Steven
Steven

Reputation: 15258

coalesce(1) or repartition(1) will put all your data on 1 partition (with a shuffle step when you use repartition compare to coalesce). In that case, only 1 worker will have to write all your data, which is the reason why you have performance issues - you already figured it out.

That is the only way you can use Spark to write 1 file on S3. Currently, there is no other way using just Spark.

Using Python (or Scala), you can do some other things. For example, you write all your files with spark without changing the number of partitions, and then :

  • you acquire your files with python
  • you concatenate your files as one
  • you upload that one file on S3.

It works well for CSV, not that well for non-sequential file type.

Upvotes: 2

Related Questions