pysparkLearner
pysparkLearner

Reputation: 73

Filling missing value with mean for all columns in pyspark

I'm trying to fill missing values in my pyspark 3.0.1 data frame using mean. I'm looking for pandas like fillna function. For example

df=df.fillna(df.mean())

But so far I have found, in pyspark, is filling missing value using mean for a single column, not for whole dataset.

Can you suggest me how do I implement pandas like fillna in pyspark?

Upvotes: 1

Views: 912

Answers (1)

mck
mck

Reputation: 42332

You can try this to get the mean for all columns:

import pyspark.sql.functions as F
import numpy as np

avg = np.mean([i for i in df.select([F.mean(c) for c in df.columns]).collect()[0] if i is not None])

df2 = df.fillna(avg)

Upvotes: 1

Related Questions