Reputation: 61
Pbm:
a) Read a local file into Panda dataframe say PD_DF b) Manipulate/Massge the PD_DF and add columns to dataframe c) Need to write PD_DF to HDFS using spark. How do I do it ?
Upvotes: 5
Views: 13544
Reputation: 2095
I use Spark 1.6.0. First transform pandas dataframe into spark dataframe then spark dataframe spark rdd
sparkDF = sqlContext.createDataFrame(pandasDF)
sparkRDD = sparkDF.rdd.map(list)
type(sparkRDD)
pyspark.rdd.PipelinedRDD
Upvotes: 3
Reputation: 1896
Lets say dataframe
is of type pandas.core.frame.DataFrame then in spark 2.1 - Pyspark I did this
rdd_data = spark.createDataFrame(dataframe)\
.rdd
In case, if you want to rename any columns or select only few columns, you do them before use of .rdd
Hope it works for you also.
Upvotes: 4
Reputation: 4049
You can use the SQLContext
object to invoke the createDataFrame
method, which takes an input data
which can optionally be a Pandas DataFrame
object.
Upvotes: 9