Reputation: 696
I have a pyspark data frame as
| ID|colA|colB |colC|
+---+----+-----+----+
|ID1| 3|5.85 | LB|
|ID2| 4|12.67| RF|
|ID3| 2|20.78| LCM|
|ID4| 1| 2 | LWB|
|ID5| 6| 3 | LF|
|ID6| 7| 4 | LM|
|ID7| 8| 5 | RS|
+---+----+----+----+
My goal is to replace the values in ColC as for the values of LB,LWB,LF with x and so on as shown below.
x = [LB,LWB,LF]
y = [RF,LCM]
z = [LM,RS]
Currently I'm able to achieve this by replacing each of the values manually as in below code :
# Replacing the values LB,LWF,LF with x
df_new = df.withColumn('ColC',f.when((f.col('ColC') == 'LB')|(f.col('ColC') == 'LWB')|(f.col('ColC') == 'LF'),'x').otherwise(df.ColC))
My question here is that how can we replace the values of a column (ColC in my example) by iterating through a list (x,y,z) dynamically at once using pyspark? What is the time complexity involved? Also, how can we truncate the decimal values in ColB to 1 decmial place?
Upvotes: 1
Views: 3033
Reputation: 1462
You can try also with a regular expression using regexp_replace:
import pyspark.sql.functions as f
replacements = [
("(LB)|(LWB)|(LF)", "x"),
("(LCM)|(RF)", "y"),
("(LM)|(RS)", "z")
]
for x, y in replacements:
df = df.withColumn("colC", f.regexp_replace("colC", x, y))
Upvotes: 1
Reputation: 547
Also you can use isin function:
from pyspark.sql.functions import col, when
x = ['LB','LWB','LF']
y = ['LCM','RF']
z = ['LM','RS']
df = df.withColumn('ColC', when(col('colC').isin(x), "x")\
.otherwise(when(col('colC').isin(y), "y")\
.otherwise(when(col('colC').isin(z), "z")\
.otherwise(df.ColC))))
If you have a few lists with too many values in this way your complexity is less than blackbishop answer but in this problem his answer is easier.
Upvotes: 1
Reputation: 32710
You can use replace
on dataframe to replace the values in colC
by passing a dict object for the mappings. And round
function to limit the number of decimals in colB
:
from pyspark.sql import functions as F
replacement = {
"LB": "x", "LWB": "x", "LF": "x",
"RF": "y", "LCM": "y",
"LM": "z", "RS": "z"
}
df1 = df.replace(replacement, ["colC"]).withColumn("colB", F.round("colB", 1))
df1.show()
#+---+----+----+----+
#| ID|colA|colB|colC|
#+---+----+----+----+
#|ID1| 3| 5.9| x|
#|ID2| 4|12.7| y|
#|ID3| 2|20.8| y|
#|ID4| 1| 2.0| x|
#|ID5| 6| 3.0| x|
#|ID6| 7| 4.0| z|
#|ID7| 8| 5.0| z|
#+---+----+----+----+
Upvotes: 1
Reputation: 42422
You can coalesce
the when
statements if you have many conditions to match. You can also use a dictionary to hold the columns to be converted, and construct the when
statements dynamically using a dict comprehension. As for rounding to 1 decimal place, you can use round
.
import pyspark.sql.functions as F
xyz_dict = {'x': ['LB','LWB','LF'],
'y': ['RF','LCM'],
'z': ['LM','RS']}
df2 = df.withColumn(
'colC',
F.coalesce(*[F.when(F.col('colC').isin(v), k) for (k, v) in xyz_dict.items()])
).withColumn(
'colB',
F.round('colB', 1)
)
df2.show()
+---+----+----+----+
| ID|colA|colB|colC|
+---+----+----+----+
|ID1| 3| 5.9| x|
|ID2| 4|12.7| y|
|ID3| 2|20.8| y|
|ID4| 1| 2.0| x|
|ID5| 6| 3.0| x|
|ID6| 7| 4.0| z|
|ID7| 8| 5.0| z|
+---+----+----+----+
Upvotes: 3