PySpark - new column partial regex matching from dictionary

Question

I have a PySpark dataframe like this:

A	B
1	abc_value
2	abc_value
3	some_other_value
4	anything_else

I have a mapping dictionary:

d = {
"abc":"X",
"some_other":Y,
"anything":Z
}

I need to create new column in my original Dataframe which should be like this:

A	B	C
1	abc_value	X
2	abc_value	X
3	some_other_value	Y
4	anything_else	Z

I tried mapping like this:

mapping_expr = f.create_map([f.lit(x) for x in chain(*d.items())]) and then applying it with withColumn however it is exact matching, however I need partial (regex) matching as you can see.

How to accomplish this, please?

Ric S · Accepted Answer

I'm afraid in PySpark there's no implemented function that extracts substrings according to a defined dictionary; you probably need to resort to tricks.

In this case, you can first create a search string which includes all your dictionary keys to be searched:

keys = list(d.keys())
keys_expr = '|'.join(keys)

keys_expr
# 'abc|some_other|anything'

Then you can use regexp_extract to extract the first key from keys_expr that we encounter in column B, if present (that's the reason for the | operator).
Finally, you can use dictionary d to replace the values in the new column.

import pyspark.sql.functions as F

df = df\
  .withColumn('C', F.regexp_extract('B', keys_expr, 0))\
  .replace(d, subset=['C'])

df.show()

+---+----------------+---+
|  A|               B|  C|
+---+----------------+---+
|  1|       abc_value|  X|
|  2|       abc_value|  X|
|  3|some_other_value|  Y|
|  4|   anything_else|  Z|
+---+----------------+---+

PySpark - new column partial regex matching from dictionary

Answers (1)

Related Questions