Reputation: 6854
I'd like to convert pyspark.sql.dataframe.DataFrame
to pyspark.rdd.RDD[String]
I converted a DataFrame df
to RDD data
:
data = df.rdd
type (data)
## pyspark.rdd.RDD
the new RDD data
contains Row
first = data.first()
type(first)
## pyspark.sql.types.Row
data.first()
Row(_c0=u'aaa', _c1=u'bbb', _c2=u'ccc', _c3=u'ddd')
I'd like to convert Row
to list of String
, like example below:
u'aaa',u'bbb',u'ccc',u'ddd'
Thanks
Upvotes: 8
Views: 38500
Reputation: 477
The accepted answer is old. With Spark 2.0, you must now explicitly state that you're converting to an rdd by adding .rdd
to the statement. Therefore, the equivalent of this statement in Spark 1.0:
data.map(list)
Should now be:
data.rdd.map(list)
in Spark 2.0. Related to the accepted answer in this post.
Upvotes: 0
Reputation: 330383
PySpark Row
is just a tuple
and can be used as such. All you need here is a simple map
(or flatMap
if you want to flatten the rows as well) with list
:
data.map(list)
or if you expect different types:
data.map(lambda row: [str(c) for c in row])
Upvotes: 14