Filling missing value with mean for all columns in pyspark

Question

I'm trying to fill missing values in my pyspark 3.0.1 data frame using mean. I'm looking for pandas like fillna function. For example

df=df.fillna(df.mean())

But so far I have found, in pyspark, is filling missing value using mean for a single column, not for whole dataset.

Can you suggest me how do I implement pandas like fillna in pyspark?

mck · Accepted Answer

You can try this to get the mean for all columns:

import pyspark.sql.functions as F
import numpy as np

avg = np.mean([i for i in df.select([F.mean(c) for c in df.columns]).collect()[0] if i is not None])

df2 = df.fillna(avg)

Filling missing value with mean for all columns in pyspark

Answers (1)

Related Questions