Reputation: 11
Context
I'm trying to write a dataframe using PySpark to .csv. In other posts, I've seen users question this, but I need a .csv for business requirements.
What I've Tried
Almost everything. I've tried .repartition(), I've tried increasing driver memory to 1T. I also tried caching my data first and then writing to csv(which is why the screenshots below indicate I'm trying to cache vs. write out to csv) Nothing seems to work.
What Happens
So, the UI does not show that any tasks fail. The job--whether it's writing to csv or caching first, gets close to completion and just hangs.
Screenshots
Then..if I drill down into the job..
And if I drill down further
Finally, here are my settings:
Upvotes: 1
Views: 3008
Reputation: 2011
As you are using databricks.. can you try Using the databricks-csv package and let us know
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('file.csv')
train.write.format('com.databricks.spark.csv').save('file_after_processing.csv')
Upvotes: 0
Reputation: 5536
You don't need to cache the dataframe as cache helps when there are multiple actions performed and if not required I would suggest you to remove count also.. Now while saving the dataframe make sure all the executors are being used.
If your dataframe is of 50 gb make sure you are not creating multiple small files as it will degrade the performance.
You can repartition the data before saving so if your dataframe have a column whic equally divides the dataframe use that or find optimum number to repartition.
df.repartition('col', 10).write.csv()
Or
#you have 32 executors with 12 cores each so repartition accordingly
df.repartition(300).write.csv()
Upvotes: 1