PySpark column to RDD of its values

Question

I'm looking for the most straightforward and idiomatic way to convert a data-frame column into a RDD. Say the columns views contains floats. The following is not what I am looking for

views = df_filtered.select("views").rdd

for I end up with a RDD[Row] instead of a RDD[Float] and I thus can't feed it to any stat methods from mllib.stat (if I properly understand what's going on):

corr = Statistics.corr(views, likes, method="pearson")
TypeError: float() argument must be a string or a number

In pandas, I would go for .values() to convert this pandas Series into the array of its values but RDD .values() method does not seem to work this way. I finally came to the following solution

views = df_filtered.select("views").rdd.map(lambda r: r["views"])

but I wonderer whether there are more direct solutions

vikrant rana · Accepted Answer

you need to use flatMap for this.

>>> newdf=df.select("emp_salary")
>>> newdf.show();
+----------+
|emp_salary|
+----------+
|     50000|
|     10000|
|    810000|
|      5500|
|      5500|
+----------+

>>> rdd=newdf.rdd.flatMap(lambda x:x)
>>> rdd.take(10);
[50000, 10000, 810000, 5500, 5500]

were your looking something like this?

yes than convert your statement as:

views = df_filtered.select("views").rdd.flatMap(lambda x:x)

PySpark column to RDD of its values

Answers (2)

Related Questions