Poor spark performance writing to csv

Context

I'm trying to write a dataframe using PySpark to .csv. In other posts, I've seen users question this, but I need a .csv for business requirements.

What I've Tried

Almost everything. I've tried .repartition(), I've tried increasing driver memory to 1T. I also tried caching my data first and then writing to csv(which is why the screenshots below indicate I'm trying to cache vs. write out to csv) Nothing seems to work.

What Happens

So, the UI does not show that any tasks fail. The job--whether it's writing to csv or caching first, gets close to completion and just hangs.

Screenshots enter image description here

Then..if I drill down into the job..

enter image description here

And if I drill down further enter image description here

Finally, here are my settings: enter image description here

Upvotes: 1

Answers (2)

dsk

Reputation: 2011

As you are using databricks.. can you try Using the databricks-csv package and let us know

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)


df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('file.csv')
train.write.format('com.databricks.spark.csv').save('file_after_processing.csv')

Upvotes: 0

Shubham Jain

Reputation: 5536

You don't need to cache the dataframe as cache helps when there are multiple actions performed and if not required I would suggest you to remove count also.. Now while saving the dataframe make sure all the executors are being used.

If your dataframe is of 50 gb make sure you are not creating multiple small files as it will degrade the performance.

You can repartition the data before saving so if your dataframe have a column whic equally divides the dataframe use that or find optimum number to repartition.

df.repartition('col', 10).write.csv()

Or

#you have 32  executors with 12 cores each so repartition accordingly

df.repartition(300).write.csv()

Upvotes: 1

Poor spark performance writing to csv

Answers (2)

Related Questions