Reputation: 137
I want to know how to replace outlier values with mean. I have dataframe I could able to find the outlier and filter the rows and now I want to replace it with mean values. How do I do it?
df is like:
a b
1 27 0
2 10 1
3 80 2
4 21 3
5 46 4
6 100 5
After finding IQR I get outliers as this:
Upper = 75
lower = 12
outliers = df.filter((df['a'] > upper) | (df['a'] < lower))
2 10 1
3 80 2
6 100 5
Now I found the mean:
from pyspark.sql.functions import mean as _mean, col
mean= df.select(_mean(col('a')).alias('mean')).collect()
mean = mean[0]['mean']
mean : 31.333
Now I am not understanding how do I replace the mean by rounding it to 31 and replace it with the outlier values in pyspark.
Upvotes: 0
Views: 1296
Reputation: 42352
You can use when
to replace the outlier values using the given conditions. To replace with the mean, you can use the mean
window function instead of collecting it to a variable, and round it to the nearest integer using F.round
:
from pyspark.sql import functions as F, Window
upper = 75
lower = 12
df2 = df.withColumn(
'a',
F.when(
(df['a'] > upper) | (df['a'] < lower),
F.round(F.mean('a').over(Window.orderBy(F.lit(1)))).cast('int')
# or you can use
# F.round(F.lit(df.select(F.mean(F.col('a')).alias('mean')).collect()[0]['mean'])).cast('int')
).otherwise(F.col('a'))
)
df2.show()
+---+---+
| a| b|
+---+---+
| 27| 0|
| 47| 1|
| 47| 2|
| 21| 3|
| 46| 4|
| 47| 5|
+---+---+
Upvotes: 2