How can I convert Dataframe Column1:Column2 (key:value) in Dictionary in Pyspark?

I have a Dataframe with distinct values of Atr1 and that has some other attributes, and I want to generate a dictionary from it, considering the key of the dictionary each of the values of the Atr1 (unique values, as I told before), and the values of the dict the values of the Atr2.

Here is the Dataframe:

+------+------+------+------+
| Atr1 | Atr2 | Atr3 | Atr4 |
+------+------+------+------+
|  'C' |  'B' |  21  |  'H' |
+------+------+------+------+
|  'D' |  'C' |  21  |  'J' |
+------+------+------+------+
|  'E' |  'B' |  21  |  'K' |
+------+------+------+------+
|  'A' |  'D' |  24  |  'I' |
+------+------+------+------+

I want to get a Dictionary like this:

Dict -> {'C': 'B', 'D': 'C', 'E': 'B', 'A': 'D'}

How could I do it?

Upvotes: 1

Answers (4)

MaxU - stand with Ukraine

Reputation: 210972

Pandas solution:

df.select('attr1','attr2').toPandas().set_index('Atr1')['Atr2'].to_dict()

NOTE: @mtoto's solution is much more elegant, faster and needs less resources...

Upvotes: 0

mtoto

Reputation: 24198

You can just use a simple collectAsMap():

df.select("Atr1", "Atr2").rdd.collectAsMap()

Upvotes: 9

zipa

Reputation: 27889

You can use something like this:

attr1 = df.select('attr1').rdd.flatMap(lambda x: x).collect()
attr2 = df.select('attr2').rdd.flatMap(lambda x: x).collect()
result = {k: v for k, v in zip(attr1, attr2)}

Upvotes: 1

Alfons Schuck

Reputation: 56

What about using df.to_dict()?

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_dict.html

import pandas as pd
df = pd.DataFrame({'A1':['C','D','E', 'A'], 'A2':['B','C','B','C']})

   A1 A2
0  C  B
1  D  C
2  E  B
3  A  D

df = df.set_index('A1')
dict = df.to_dict()['A2']

results in

dict = {'C': 'B', 'A': 'D', 'D': 'C', 'E': 'B'}

Upvotes: 0

How can I convert Dataframe Column1:Column2 (key:value) in Dictionary in Pyspark?

Answers (4)

Related Questions