ArtemArapov
ArtemArapov

Reputation: 11

How to process data in parallel but write results in a single file in Spark

I have a Spark job that:

Let's say I have 10GB of raw data (40 blocks = 40 input partitions), which results in 100MB of processed data. To avoid generating many small files in hdfs I use "coalesce(1)" statement in order to write single file with results. Doing so I get only 1 task running (because of "coalesce(1)" and absence of shuffling), which processes all 10GB in a single thread.

Is there a way to do actual intensive processing in 40 parallel tasks and reduce number of partitions right before writing to disk and avoid data shuffle?

I have an idea that might work - to cache dataframe in memory after all processing (do a count to force Spark to cache the data) and then put "coalesce(1)" and write dataframe to disk

Upvotes: 1

Views: 1102

Answers (1)

user10586111
user10586111

Reputation: 21

The documentation clearly warns about this behavior and provides the solution:

However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can call repartition. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).

So instead

coalesce(1)

you can try

repartition(1)

Upvotes: 2

Related Questions