How to write pyspark dataframe to HDFS and then how to read it back into dataframe?

Question

I have a very big pyspark dataframe. So I want to perform pre processing on subsets of it and then store them to hdfs. Later I want to read all of them and merge together. Thanks.

rogue-one · Accepted Answer

writing DataFrame to HDFS (Spark 1.6).

df.write.save('/target/path/', format='parquet', mode='append') ## df is an existing DataFrame object.

some of the format options are csv, parquet, json etc.

reading DataFrame from HDFS (Spark 1.6).

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
sqlContext.read.format('parquet').load('/path/to/file')

the format method takes argument such as parquet, csv, json etc.

How to write pyspark dataframe to HDFS and then how to read it back into dataframe?

Answers (1)

Related Questions