dragonachu
dragonachu

Reputation: 551

Pyspark - how to group by and create a key value pair column

I have a data similar to the below one:

Col1,col2,col3
a,1,#
b,2,$
c,3,%

I need to create a new column with col2 as key and col3 as value, similar to below:

Col1,col2,col3,col4
a,1,#,{1:#}
b,2,$,{2:$}
c,3,%,{4:%}

How can I achieve this using pyspark?

Upvotes: 0

Views: 790

Answers (1)

mck
mck

Reputation: 42422

Try format_string:

import pyspark.sql.functions as F

df2 = df.withColumn('col4', F.format_string('{%d:%s}', 'col2', 'col3'))

df2.show()
+----+----+----+-----+
|Col1|col2|col3| col4|
+----+----+----+-----+
|   a|   1|   #|{1:#}|
|   b|   2|   $|{2:$}|
|   c|   3|   %|{3:%}|
+----+----+----+-----+

If you want a key-value relationship, maps might be more appropriate:

df2 = df.withColumn('col4', F.create_map('col2', 'col3'))

df2.show()
+----+----+----+--------+
|Col1|col2|col3|    col4|
+----+----+----+--------+
|   a|   1|   #|[1 -> #]|
|   b|   2|   $|[2 -> $]|
|   c|   3|   %|[3 -> %]|
+----+----+----+--------+

You can also convert the map to a JSON string, similar to your expected output:

df2 = df.withColumn('col4', F.to_json(F.create_map('col2', 'col3')))

df2.show()
+----+----+----+---------+
|Col1|col2|col3|     col4|
+----+----+----+---------+
|   a|   1|   #|{"1":"#"}|
|   b|   2|   $|{"2":"$"}|
|   c|   3|   %|{"3":"%"}|
+----+----+----+---------+

Upvotes: 1

Related Questions