Reputation: 3191
I'm looking for the most straightforward and idiomatic way to convert a data-frame column into a RDD. Say the columns views
contains floats. The following is not what I am looking for
views = df_filtered.select("views").rdd
for I end up with a RDD[Row]
instead of a RDD[Float]
and I thus can't feed it to any stat methods from mllib.stat (if I properly understand what's going on):
corr = Statistics.corr(views, likes, method="pearson")
TypeError: float() argument must be a string or a number
In pandas, I would go for .values()
to convert this pandas Series into the array of its values but RDD .values()
method does not seem to work this way. I finally came to the following solution
views = df_filtered.select("views").rdd.map(lambda r: r["views"])
but I wonderer whether there are more direct solutions
Upvotes: 2
Views: 8441
Reputation: 4701
you need to use flatMap for this.
>>> newdf=df.select("emp_salary")
>>> newdf.show();
+----------+
|emp_salary|
+----------+
| 50000|
| 10000|
| 810000|
| 5500|
| 5500|
+----------+
>>> rdd=newdf.rdd.flatMap(lambda x:x)
>>> rdd.take(10);
[50000, 10000, 810000, 5500, 5500]
were your looking something like this?
yes than convert your statement as:
views = df_filtered.select("views").rdd.flatMap(lambda x:x)
Upvotes: 7
Reputation: 1645
Using the next higher abstraction of RDDs 'Dataframe' you can do this.
from pyspark import SparkContext
from pyspark import SQLContext
from pyspark.sql.types import FloatType
import pandas as pd
#data creation (for example)
dictonary = {'views': [1.902, 2.34334, 0.3434], 'some_other_column':[1,2,3]}
df = pd.DataFrame(data=dictonary)
#create spark context
sc = SparkContext("local", "First App1")
sql = SQLContext(sc)
#create spark dataframe from pandas dataframe
spark_df = sql.createDataFrame(df['views'], FloatType())
spark_rdd = spark_df.rdd
There may be a less cumbersome way to do it, but this might give you some inspiration. Remember that RDDs are immutable.
Upvotes: -1