Reputation: 164
I am trying to write spark dataframe to s3 using pysparkn and spark-csv using following code
df1.filter( df1['y'] == 2)\
.withColumnRenamed("x",'a')\
.select("a","b","c")\
.write\
.format('com.databricks.spark.csv')\
.options(header="true")\
.options(codec="org.apache.hadoop.io.compress.BZip2Codec")\
.save('s3://bucket/abc/output")
but, I am getting error that "output dir already exists", i am sure that output dir does not exist before job started, i tried running with different output dir name but write is still failing.
If i look at s3 bucket after job failed, i found that there are few part file are written by spark but when it try to write more it is failing, script running fine locally, I am using 10 spark executor on aws cluster. Does anyone have any idea what is wrong with this code ?
Upvotes: 1
Views: 1805
Reputation: 36
Try to use below code it should fix the problem. Internally it uses Hadoop api to check existence of file.Please check logs to executor. You may find something useful.
df1.filter( df1['y'] == 2)\
.withColumnRenamed("x",'a')
.select("a","b","c")
.write
.mode(Overwrite)
.format('com.databricks.spark.csv')
.options(header="true")
.options(codec="org.apache.hadoop.io.compress.BZip2Codec")
.save('s3://bucket/abc/output")
Upvotes: 1