user3046211
user3046211

Reputation: 696

Replacing column values in pyspark by iterating through list

I have a pyspark data frame as

| ID|colA|colB |colC|
+---+----+-----+----+
|ID1|   3|5.85 |  LB|
|ID2|   4|12.67|  RF|
|ID3|   2|20.78| LCM|
|ID4|   1|   2 | LWB|
|ID5|   6|   3 |  LF|
|ID6|   7|   4 |  LM|
|ID7|   8|   5 |  RS|
+---+----+----+----+

My goal is to replace the values in ColC as for the values of LB,LWB,LF with x and so on as shown below.

x = [LB,LWB,LF]
y = [RF,LCM]
z = [LM,RS]

Currently I'm able to achieve this by replacing each of the values manually as in below code :

# Replacing the values LB,LWF,LF with x

df_new = df.withColumn('ColC',f.when((f.col('ColC') == 'LB')|(f.col('ColC') == 'LWB')|(f.col('ColC') == 'LF'),'x').otherwise(df.ColC))

My question here is that how can we replace the values of a column (ColC in my example) by iterating through a list (x,y,z) dynamically at once using pyspark? What is the time complexity involved? Also, how can we truncate the decimal values in ColB to 1 decmial place?

Upvotes: 1

Views: 3033

Answers (4)

Shadowtrooper
Shadowtrooper

Reputation: 1462

You can try also with a regular expression using regexp_replace:

import pyspark.sql.functions as f

replacements = [
    ("(LB)|(LWB)|(LF)", "x"),
    ("(LCM)|(RF)", "y"),
    ("(LM)|(RS)", "z")
]
for x, y in replacements:
    df = df.withColumn("colC", f.regexp_replace("colC", x, y))

Upvotes: 1

yasi
yasi

Reputation: 547

Also you can use isin function:

from pyspark.sql.functions import col, when

x = ['LB','LWB','LF']
y = ['LCM','RF']
z = ['LM','RS']

df = df.withColumn('ColC', when(col('colC').isin(x), "x")\
                  .otherwise(when(col('colC').isin(y), "y")\
                  .otherwise(when(col('colC').isin(z), "z")\
                  .otherwise(df.ColC))))

If you have a few lists with too many values in this way your complexity is less than blackbishop answer but in this problem his answer is easier.

Upvotes: 1

blackbishop
blackbishop

Reputation: 32710

You can use replace on dataframe to replace the values in colC by passing a dict object for the mappings. And round function to limit the number of decimals in colB:

from pyspark.sql import functions as F

replacement = {
    "LB": "x", "LWB": "x", "LF": "x",
    "RF": "y", "LCM": "y",
    "LM": "z", "RS": "z"
}

df1 = df.replace(replacement, ["colC"]).withColumn("colB", F.round("colB", 1))

df1.show()

#+---+----+----+----+
#| ID|colA|colB|colC|
#+---+----+----+----+
#|ID1|   3| 5.9|   x|
#|ID2|   4|12.7|   y|
#|ID3|   2|20.8|   y|
#|ID4|   1| 2.0|   x|
#|ID5|   6| 3.0|   x|
#|ID6|   7| 4.0|   z|
#|ID7|   8| 5.0|   z|
#+---+----+----+----+

Upvotes: 1

mck
mck

Reputation: 42422

You can coalesce the when statements if you have many conditions to match. You can also use a dictionary to hold the columns to be converted, and construct the when statements dynamically using a dict comprehension. As for rounding to 1 decimal place, you can use round.

import pyspark.sql.functions as F

xyz_dict = {'x': ['LB','LWB','LF'],
            'y': ['RF','LCM'],
            'z': ['LM','RS']}

df2 = df.withColumn(
    'colC',
    F.coalesce(*[F.when(F.col('colC').isin(v), k) for (k, v) in xyz_dict.items()])
).withColumn(
    'colB',
    F.round('colB', 1)
)

df2.show()
+---+----+----+----+
| ID|colA|colB|colC|
+---+----+----+----+
|ID1|   3| 5.9|   x|
|ID2|   4|12.7|   y|
|ID3|   2|20.8|   y|
|ID4|   1| 2.0|   x|
|ID5|   6| 3.0|   x|
|ID6|   7| 4.0|   z|
|ID7|   8| 5.0|   z|
+---+----+----+----+

Upvotes: 3

Related Questions