Reputation: 73
I'm trying to fill missing values in my pyspark 3.0.1 data frame using mean. I'm looking for pandas like fillna
function. For example
df=df.fillna(df.mean())
But so far I have found, in pyspark, is filling missing value using mean for a single column, not for whole dataset.
Can you suggest me how do I implement pandas like fillna
in pyspark?
Upvotes: 1
Views: 912
Reputation: 42332
You can try this to get the mean for all columns:
import pyspark.sql.functions as F
import numpy as np
avg = np.mean([i for i in df.select([F.mean(c) for c in df.columns]).collect()[0] if i is not None])
df2 = df.fillna(avg)
Upvotes: 1