Wendy
Wendy

Reputation: 89

How to impute missing value with column mean using sparklyr, for selected columns?

For Spark data frames in sparklyr, I know NA can be imputed by a fixed number using na.replace(number), also I know I can do na.replace(x=something) for a hard coded column.

Now I have a vector containing the column names I want to impute missing value with mean value. What can I do to insert mean for all the missing values within these columns?

I looked into spark_apply to apply mice on it, but didn't figure out a solution yet.

Thank you!

Upvotes: 3

Views: 958

Answers (1)

zero323
zero323

Reputation: 330353

You can use Imputer. Let's say data looks like this:

df <- copy_to(sc, tibble(id=1:3, x=c(1, NA, 3), y=c(NA, 2, -1)))

The transformer requires input and output column lists:

input_cols <- c("x", "y")
output_cols <- paste0(input_cols, "_imp")

and can be applied as shown below:

df %>% 
  ft_imputer(input_cols=input_cols, output_cols=output_cols, strategy="mean")
# Source:   table<sparklyr_tmp_73a32e74369c> [?? x 5]
# Database: spark_connection
     id     x     y x_imp y_imp
  <int> <dbl> <dbl> <dbl> <dbl>
1     1     1   NaN     1   0.5
2     2   NaN     2     2   2  
3     3     3    -1     3  -1  

Upvotes: 3

Related Questions