How to replace outlier values with mean in pyspark?

Question

I want to know how to replace outlier values with mean. I have dataframe I could able to find the outlier and filter the rows and now I want to replace it with mean values. How do I do it?

df is like:

       a     b
1      27    0
2      10    1
3      80    2
4      21    3
5      46    4
6      100   5

After finding IQR I get outliers as this:

Upper = 75
lower = 12
outliers = df.filter((df['a'] > upper) | (df['a'] < lower))
2      10    1
3      80    2
6      100   5

Now I found the mean:

from pyspark.sql.functions import mean as _mean, col
mean= df.select(_mean(col('a')).alias('mean')).collect()
mean = mean[0]['mean']
mean : 31.333

Now I am not understanding how do I replace the mean by rounding it to 31 and replace it with the outlier values in pyspark.

mck · Accepted Answer

You can use when to replace the outlier values using the given conditions. To replace with the mean, you can use the mean window function instead of collecting it to a variable, and round it to the nearest integer using F.round:

from pyspark.sql import functions as F, Window

upper = 75
lower = 12

df2 = df.withColumn(
    'a', 
    F.when(
        (df['a'] > upper) | (df['a'] < lower), 
        F.round(F.mean('a').over(Window.orderBy(F.lit(1)))).cast('int')
        # or you can use 
        # F.round(F.lit(df.select(F.mean(F.col('a')).alias('mean')).collect()[0]['mean'])).cast('int')
    ).otherwise(F.col('a'))
)

df2.show()
+---+---+
|  a|  b|
+---+---+
| 27|  0|
| 47|  1|
| 47|  2|
| 21|  3|
| 46|  4|
| 47|  5|
+---+---+

How to replace outlier values with mean in pyspark?

Answers (1)

Related Questions