Reputation: 1331
I have a PySpark dataframe like this:
A | B |
---|---|
1 | abc_value |
2 | abc_value |
3 | some_other_value |
4 | anything_else |
I have a mapping dictionary:
d = {
"abc":"X",
"some_other":Y,
"anything":Z
}
I need to create new column in my original Dataframe which should be like this:
A | B | C |
---|---|---|
1 | abc_value | X |
2 | abc_value | X |
3 | some_other_value | Y |
4 | anything_else | Z |
I tried mapping like this:
mapping_expr = f.create_map([f.lit(x) for x in chain(*d.items())])
and then applying it with withColumn
however it is exact matching, however I need partial (regex) matching as you can see.
How to accomplish this, please?
Upvotes: 1
Views: 834
Reputation: 9277
I'm afraid in PySpark there's no implemented function that extracts substrings according to a defined dictionary; you probably need to resort to tricks.
In this case, you can first create a search string which includes all your dictionary keys to be searched:
keys = list(d.keys())
keys_expr = '|'.join(keys)
keys_expr
# 'abc|some_other|anything'
Then you can use regexp_extract
to extract the first key from keys_expr
that we encounter in column B
, if present (that's the reason for the |
operator).
Finally, you can use dictionary d
to replace the values in the new column.
import pyspark.sql.functions as F
df = df\
.withColumn('C', F.regexp_extract('B', keys_expr, 0))\
.replace(d, subset=['C'])
df.show()
+---+----------------+---+
| A| B| C|
+---+----------------+---+
| 1| abc_value| X|
| 2| abc_value| X|
| 3|some_other_value| Y|
| 4| anything_else| Z|
+---+----------------+---+
Upvotes: 2