Ram Narayanan
Ram Narayanan

Reputation: 61

Can I convert pandas dataframe to spark rdd?

Pbm:

a) Read a local file into Panda dataframe say PD_DF b) Manipulate/Massge the PD_DF and add columns to dataframe c) Need to write PD_DF to HDFS using spark. How do I do it ?

Upvotes: 5

Views: 13544

Answers (3)

Erkan Şirin
Erkan Şirin

Reputation: 2095

I use Spark 1.6.0. First transform pandas dataframe into spark dataframe then spark dataframe spark rdd

sparkDF = sqlContext.createDataFrame(pandasDF)
sparkRDD = sparkDF.rdd.map(list)
type(sparkRDD)
pyspark.rdd.PipelinedRDD

Upvotes: 3

sam
sam

Reputation: 1896

Lets say dataframe is of type pandas.core.frame.DataFrame then in spark 2.1 - Pyspark I did this

rdd_data = spark.createDataFrame(dataframe)\
                .rdd

In case, if you want to rename any columns or select only few columns, you do them before use of .rdd

Hope it works for you also.

Upvotes: 4

caring-goat-913
caring-goat-913

Reputation: 4049

You can use the SQLContext object to invoke the createDataFrame method, which takes an input data which can optionally be a Pandas DataFrame object.

Upvotes: 9

Related Questions