A2N15
A2N15

Reputation: 605

replace key value from dictionary

Below is my DF:

deviceDict = {'TABLET' : 'MOBILE', 'PHONE':'MOBILE', 'PC':'Desktop', 'CEDEX' : '', 'ST' : 'SAINT', 'AV' : 'AVENUE', 'BD': 'BOULEVARD'}
df = spark.createDataFrame([('TABLET', 'DAF ST PAQ BD'), ('PHONE', 'AVOTHA'),  ('PC', 'STPA CEDEX'), ('OTHER', 'AV DAF'), (None, None)], ["device_type", 'City'])
df.show()

Output:

+-----------+-------------+
|device_type|         City|
+-----------+-------------+
|     TABLET|DAF ST PAQ BD|
|      PHONE|       AVOTHA|
|         PC|   STPA CEDEX|
|      OTHER|       AV DAF|
|       null|         null|
+-----------+-------------+

The aim is to replace key/value, solution from Pyspark: Replacing value in a column by searching a dictionary

tests = df.na.replace(deviceDict, 1)

Result:

+-----------+-------------+
|device_type|         City|
+-----------+-------------+
|     MOBILE|DAF ST PAQ BD|
|     MOBILE|       AVOTHA|
|    Desktop|   STPA CEDEX|
|      OTHER|       AV DAF|
|       null|         null|
+-----------+-------------+

It worked for device_type but I wasn't able to change the city (even when using subset)

Expected output:

+-----------+------------------------+
|device_type|                    City|
+-----------+------------------------+
|     MOBILE| DAF SAINT PAQ BOULEVARD|
|     MOBILE|                  AVOTHA|
|    Desktop|                    STPA|
|      OTHER|              AVENUE DAF|
|       null|                    null|
+-----------+------------------------+

Upvotes: 2

Views: 148

Answers (1)

blackbishop
blackbishop

Reputation: 32670

The replacement doesn't occur for the column City because you're trying to do some partial replacement in the column values. Whereas function DataFrame.replace uses the entire value as a mapping.

To achieve what you want for column City, you can use multiple nested regexp_replace expressions that you can dynamically generate using Python functools.reduce for example:

from functools import reduce
import pyspark.sql.functions as F

m = list(deviceDict.items())

df1 = df.na.replace(deviceDict, 1).withColumn(
    "City",
    reduce(
        lambda acc, x: F.regexp_replace(acc, rf"\b{x[0]}\b", x[1]),
        m[1:],
        F.regexp_replace(F.col("City"), rf"\b{m[0][0]}\b", m[0][1]),
    )
)

df1.show(truncate=False)
#+-----------+-----------------------+
#|device_type|City                   |
#+-----------+-----------------------+
#|MOBILE     |DAF SAINT PAQ BOULEVARD|
#|MOBILE     |AVOTHA                 |
#|Desktop    |STPA                   |
#|OTHER      |AVENUE DAF             |
#|null       |null                   |
#+-----------+-----------------------+

Upvotes: 2

Related Questions