Reputation: 11090
I have column with multiple key value pairs as a string.
Ex:
rb=99;cs_y1=0;y2_co=CA;y2_r=ON;y2_ct=Kitchener;y2_z=N2N;y2_isp=Bell DSL Internet;y2_org=Bell DSL Internet
I need to extract the values corresponding to the keys y2_co,y2_r,y2_z and y2_org.
df.withColumn("co", split(split(col("_c1"), ";").getItem(2)),"=").getItem(1))
.withColumn("r", split(split(col("_c1"), ";").getItem(3)),"=").getItem(1))
.withColumn("z", split(split(col("_c1"), ";").getItem(5)),"=").getItem(1))
.withColumn("org", split(split(col("_c1"), ";").getItem(7),"=").getItem(1))
I have the above that works for this one string but the number of key value pairs is not constant.How do I tweak the above statement to handle variable length key value pairs?
Upvotes: 0
Views: 1687
Reputation: 14008
Use str_to_map, see below:
from pyspark.sql import functions as F
keys = ["y2_co", "y2_r", "y2_z", "y2_org"]
df.withColumn('m', F.expr("str_to_map(_c1,';','=')")) \
.selectExpr("*", *[ f"m['{k}'] as `{k}`" for k in keys ]) \
.show()
+--------------------+--------------------+-----+----+----+-----------------+
| _c1| m|y2_co|y2_r|y2_z| y2_org|
+--------------------+--------------------+-----+----+----+-----------------+
|rb=99;cs_y1=0;y2_...|[rb -> 99, cs_y1 ...| CA| ON| N2N|Bell DSL Internet|
+--------------------+--------------------+-----+----+----+-----------------+
Upvotes: 1