Reputation: 5907
Hi i have an requirement of converting a pyspark dataframe (or rdd) into a dictionary where column of dataframe will be keys and column_value_list as dictionary values.
name amt
a 10
b 20
a 30
b 40
c 50
i want a dictionary like this:
new_dict = {'name':['a','b', 'a', 'b', 'c'], 'amt':[10,20,30,40,50]}
How can I do that, (avoiding collect on rdd is preferable solution). Thanks.
I am also trying, will post my try in some time.
Upvotes: 1
Views: 20538
Reputation: 31
I had the same problem and solved it like this (python 3.x, pyspark 2.x):
def columnDict(dataFrame):
colDict = dict(zip(dataFrame.schema.names, zip(*dataFrame.collect())))
return colDict if colDict else dict.fromkeys(dataFrame.schema.names, ())
If you want to have a python dictionary, you have to collect it first. If you don´t want to collect, you could manually create a dictionary with selected and mapped RDDs
colDict[col_name] = dataFrame.select(col_name).rdd.flatMap(lambda x: x)
Like in this solution: spark - Converting dataframe to list improving performance.
Upvotes: 2
Reputation: 13274
Convert your spark dataframe into a pandas dataframe with the .toPandas
method, then use pandas's .to_dict
method to get your dictionary:
new_dict = spark_df.toPandas().to_dict(orient='list')
I am not aware of a way to make a dictionary out an rdd
or spark df
without collecting the values. You can use the .collectAsMap
method of your rdd
without the need to convert the data in a dataframe first:
rdd.collectAsMap()
I hope this helps.
Upvotes: 2