Cards14
Cards14

Reputation: 99

Convert two columns in pyspark dataframe into one python dictionary

I have a pyspark dataframe in which I want to use two of its columns to output a dictionary.

input pyspark dataframe:

col1|col2|col3
v   |  3 | a
d   |  2 | b
q   |  9 | g

output:

dict = {'v': 3, 'd': 2, 'q': 9}

how should I do this efficiently?

Upvotes: 2

Views: 4136

Answers (3)

Jomonsugi
Jomonsugi

Reputation: 1279

Given your example, after selecting the applicable columns and converting to an rdd, collectAsMap will accomplish the desired dictionary without any additional steps:

df.select('col1', 'col2').rdd.collectAsMap()

Upvotes: 2

thePurplePython
thePurplePython

Reputation: 2767

few different options here depending on the format needed ... check this out ... am using structured api ... if you need to persist then either save as json dict or preserve schema with parquet

from pyspark.sql.functions import to_json
from pyspark.sql.functions import create_map
from pyspark.sql.functions import col

df = spark\
.createDataFrame([\
    ('v', 3, 'a'),\
    ('d', 2, 'b'),\
    ('q', 9, 'g')],\
    ["c1", "c2", "c3"])

mapDF = df.select(create_map(col("c1"), col("c2")).alias("mapper"))
mapDF.show(3)

+--------+
|  mapper|
+--------+
|[v -> 3]|
|[d -> 2]|
|[q -> 9]|
+--------+

dictDF = df.select(to_json(create_map(col("c1"), col("c2")).alias("mapper")).alias("dict"))
dictDF.show()

+-------+
|   dict|
+-------+
|{"v":3}|
|{"d":2}|
|{"q":9}|
+-------+

keyValueDF = df.selectExpr("(c1, c2) as keyValueDict").select(to_json(col("keyValueDict")).alias("keyValueDict"))
keyValueDF.show()

+-----------------+
|     keyValueDict|
+-----------------+
|{"c1":"v","c2":3}|
|{"c1":"d","c2":2}|
|{"c1":"q","c2":9}|
+-----------------+

Upvotes: 1

IWHKYB
IWHKYB

Reputation: 491

I believe you can achieve it by converting the DF (with only the two columns you want) to rdd:

data_rdd = data.selet(['col1', 'col2']).rdd

create a rdd containing key, pair with both columns using rdd.map function:

kp_rdd = data_rdd.map(lambda row : (row[0],row[1]))

and then collect as map:

dict = kp_rdd.collectAsMap()

that's the main idea, sorry I don't have an instance of pyspark running right now to test it.

Upvotes: 4

Related Questions