Gui Kham
Gui Kham

Reputation: 23

Convert Row RDD embedded in Dataframe to List

IPYNB

I have Dataframe user_recommended as shown in picture. The recommendations column is a PySpark RDD of like shown below:

In[10]: user_recommended.recommendations[0]
Out[10]: [Row(item=0, rating=0.005226806737482548),
         Row(item=23, rating=0.0044402251951396465),
         Row(item=4, rating=0.004139747936278582)]

I want to convert recommendations RDD to Python List.

Is there a script that can help me to convert recommendations column in user_recommended Dataframe (note that it is of type pandas.core.frame.DataFrame) to a list.

Upvotes: 0

Views: 204

Answers (2)

ags29
ags29

Reputation: 2696

Another, slightly different approach. The value of this is, in my view, is that it would generalise more easily to Rows with more than 2 elements. Also, it is worth noting that the data structure that you preview in your question is a Pandas DF with a column consisting of lists of PySpark Row data structures and is not in fact an RDD.

from pyspark.sql import Row

# recreate the individual entries of the recommendation column
# these are lists of pyspark Row data structures
df_recommend = pd.DataFrame({'recommendations': (
[Row(item=0, rating=0.005226806737482548),
         Row(item=23, rating=0.0044402251951396465),
         Row(item=4, rating=0.004139747936278582)],)})

# now extract the values using the asDict method of the Row 
df_recommend['extracted_values'] = (
    df_recommend['recommendations']
    .apply(lambda recs: [list(x.asDict().values()) for x in recs])
)

Upvotes: 0

Ankit Kumar Namdeo
Ankit Kumar Namdeo

Reputation: 1464

I suppose you want to do this

from pyspark.sql import Row

my_rdd = sc.parallelize([Row(item=0, rating=0.005226806737482548),
         Row(item=23, rating=0.0044402251951396465),
         Row(item=4, rating=0.004139747936278582)])
my_rdd.collect()
new_rdd = my_rdd.map(lambda x: (x[0], x[1]))
new_rdd.collect()

Upvotes: 1

Related Questions