Leah210
Leah210

Reputation: 109

How to save a spark dataframe to csv on HDFS?

Spark version: 1.6.1, I use pyspark API.

DataFrame: df, which has two colume.

I have tried:

1: df.write.format('csv').save("hdfs://path/bdt_sum_vol.csv")
2: df.write.save('hdfs://path/bdt_sum_vol.csv', format='csv', mode='append')
3: df.coalesce(1).write.format('com.databricks.spark.csv').options(header='true').save('hdfs://path/')
4: df.write.format('com.databricks.spark.csv').save('hdfs://path/df.csv')

(All above didn't work, Failed to find data source)

or:

def toCSVLine(data):
    return ','.join(str(d) for d in data)

lines = df.rdd.map(toCSVLine)
lines.saveAsTextFile('hdfs://path/df.csv')  

(Permission denied)

Q:

1, How to solve "Failed to find data source"?

2, I used sudo to make the dictionary "/path" on hdfs, if I turn the dataframe to rdd, how to write the rdd to csv on hdfs?

Thanks a lot!

Upvotes: 6

Views: 20753

Answers (2)

MD Rijwan
MD Rijwan

Reputation: 491

If hdfs://yourpath/ doesn't work

Try this, In my case it worked:
df.coalesce(1).write.format('com.databricks.spark.csv').options(header='true').save("/user/user_name/file_name")

So technically we are using a single reducer if there are multiple partitions by default for this data frame. And you will get one CSV in your hdfs location.

Upvotes: 1

seasee my
seasee my

Reputation: 99

You could try to change ".save" to ".csv":

df.coalesce(1).write.mode('overwrite').option('header','true').csv('hdfs://path/df.csv')

Upvotes: 2

Related Questions