Reputation: 326
I have written a pyspark program that is reading data from cassandra and writing into aws s3 . Before writing into s3 I have to do repartition(1) or coalesce(1) as this creates one single file otherwise it creates multiple parquet files in s3 . using repartition(1) or coalesce(1) has performance issue and I feel creating one big partition is not good option with huge data . what are ways to create one single file in s3 but without compromising on performance ?
Upvotes: 0
Views: 1404
Reputation: 15258
coalesce(1)
or repartition(1)
will put all your data on 1 partition (with a shuffle step when you use repartition
compare to coalesce
). In that case, only 1 worker will have to write all your data, which is the reason why you have performance issues - you already figured it out.
That is the only way you can use Spark to write 1 file on S3. Currently, there is no other way using just Spark.
Using Python (or Scala), you can do some other things. For example, you write all your files with spark without changing the number of partitions, and then :
It works well for CSV, not that well for non-sequential file type.
Upvotes: 2