pyspark split string into key value pair and extract certain values

Question

I have column with multiple key value pairs as a string.

Ex:

rb=99;cs_y1=0;y2_co=CA;y2_r=ON;y2_ct=Kitchener;y2_z=N2N;y2_isp=Bell DSL Internet;y2_org=Bell DSL Internet

I need to extract the values corresponding to the keys y2_co,y2_r,y2_z and y2_org.

df.withColumn("co", split(split(col("_c1"), ";").getItem(2)),"=").getItem(1))
.withColumn("r",    split(split(col("_c1"), ";").getItem(3)),"=").getItem(1))
.withColumn("z",    split(split(col("_c1"), ";").getItem(5)),"=").getItem(1))
.withColumn("org",  split(split(col("_c1"), ";").getItem(7),"=").getItem(1))

I have the above that works for this one string but the number of key value pairs is not constant.How do I tweak the above statement to handle variable length key value pairs?

jxc · Accepted Answer

Use str_to_map, see below:

from pyspark.sql import functions as F

keys = ["y2_co", "y2_r", "y2_z", "y2_org"]

df.withColumn('m', F.expr("str_to_map(_c1,';','=')")) \
    .selectExpr("*", *[ f"m['{k}'] as `{k}`" for k in keys ]) \
    .show()
+--------------------+--------------------+-----+----+----+-----------------+
|                 _c1|                   m|y2_co|y2_r|y2_z|           y2_org|
+--------------------+--------------------+-----+----+----+-----------------+
|rb=99;cs_y1=0;y2_...|[rb -> 99, cs_y1 ...|   CA|  ON| N2N|Bell DSL Internet|
+--------------------+--------------------+-----+----+----+-----------------+

pyspark split string into key value pair and extract certain values

Answers (1)

Related Questions