Satya
Satya

Reputation: 5907

pyspark dataframe to dictionary: columns as keys and list of column values ad dict value

Hi i have an requirement of converting a pyspark dataframe (or rdd) into a dictionary where column of dataframe will be keys and column_value_list as dictionary values.

name amt
a    10
b    20
a    30
b    40
c    50

i want a dictionary like this:

new_dict = {'name':['a','b', 'a', 'b', 'c'], 'amt':[10,20,30,40,50]}

How can I do that, (avoiding collect on rdd is preferable solution). Thanks.

I am also trying, will post my try in some time.

Upvotes: 1

Views: 20538

Answers (2)

christian_de
christian_de

Reputation: 31

I had the same problem and solved it like this (python 3.x, pyspark 2.x):

def columnDict(dataFrame):
    colDict = dict(zip(dataFrame.schema.names, zip(*dataFrame.collect())))
    return colDict if colDict else dict.fromkeys(dataFrame.schema.names, ())

If you want to have a python dictionary, you have to collect it first. If you don´t want to collect, you could manually create a dictionary with selected and mapped RDDs

colDict[col_name] = dataFrame.select(col_name).rdd.flatMap(lambda x: x)

Like in this solution: spark - Converting dataframe to list improving performance.

Upvotes: 2

Abdou
Abdou

Reputation: 13274

Convert your spark dataframe into a pandas dataframe with the .toPandas method, then use pandas's .to_dict method to get your dictionary:

new_dict = spark_df.toPandas().to_dict(orient='list')

Edit:

I am not aware of a way to make a dictionary out an rdd or spark df without collecting the values. You can use the .collectAsMap method of your rdd without the need to convert the data in a dataframe first:

rdd.collectAsMap()

I hope this helps.

Upvotes: 2

Related Questions