zar3bski
zar3bski

Reputation: 3191

PySpark column to RDD of its values

I'm looking for the most straightforward and idiomatic way to convert a data-frame column into a RDD. Say the columns views contains floats. The following is not what I am looking for

views = df_filtered.select("views").rdd

for I end up with a RDD[Row] instead of a RDD[Float] and I thus can't feed it to any stat methods from mllib.stat (if I properly understand what's going on):

corr = Statistics.corr(views, likes, method="pearson")
TypeError: float() argument must be a string or a number

In pandas, I would go for .values() to convert this pandas Series into the array of its values but RDD .values() method does not seem to work this way. I finally came to the following solution

views = df_filtered.select("views").rdd.map(lambda r: r["views"])

but I wonderer whether there are more direct solutions

Upvotes: 2

Views: 8441

Answers (2)

vikrant rana
vikrant rana

Reputation: 4701

you need to use flatMap for this.

>>> newdf=df.select("emp_salary")
>>> newdf.show();
+----------+
|emp_salary|
+----------+
|     50000|
|     10000|
|    810000|
|      5500|
|      5500|
+----------+

>>> rdd=newdf.rdd.flatMap(lambda x:x)
>>> rdd.take(10);
[50000, 10000, 810000, 5500, 5500]

were your looking something like this?

yes than convert your statement as:

views = df_filtered.select("views").rdd.flatMap(lambda x:x)

Upvotes: 7

DataBach
DataBach

Reputation: 1645

Using the next higher abstraction of RDDs 'Dataframe' you can do this.

from pyspark import SparkContext
from pyspark import SQLContext
from pyspark.sql.types import FloatType
import pandas as pd

#data creation (for example)
dictonary = {'views': [1.902, 2.34334, 0.3434], 'some_other_column':[1,2,3]}
df = pd.DataFrame(data=dictonary)

#create spark context
sc = SparkContext("local", "First App1")
sql = SQLContext(sc)

#create spark dataframe from pandas dataframe
spark_df = sql.createDataFrame(df['views'], FloatType())
spark_rdd = spark_df.rdd

There may be a less cumbersome way to do it, but this might give you some inspiration. Remember that RDDs are immutable.

Upvotes: -1

Related Questions