Ajg
Ajg

Reputation: 257

How to write pyspark dataframe to HDFS and then how to read it back into dataframe?

I have a very big pyspark dataframe. So I want to perform pre processing on subsets of it and then store them to hdfs. Later I want to read all of them and merge together. Thanks.

Upvotes: 17

Views: 38813

Answers (1)

rogue-one
rogue-one

Reputation: 11587

  • writing DataFrame to HDFS (Spark 1.6).

    df.write.save('/target/path/', format='parquet', mode='append') ## df is an existing DataFrame object.
    

some of the format options are csv, parquet, json etc.

  • reading DataFrame from HDFS (Spark 1.6).

    from pyspark.sql import SQLContext
    sqlContext = SQLContext(sc)
    sqlContext.read.format('parquet').load('/path/to/file') 
    

the format method takes argument such as parquet, csv, json etc.

Upvotes: 14

Related Questions