Reputation: 89
For Spark data frames in sparklyr, I know NA
can be imputed by a fixed number using na.replace(number)
, also I know I can do na.replace(x=something)
for a hard coded column.
Now I have a vector containing the column names I want to impute missing value with mean value. What can I do to insert mean for all the missing values within these columns?
I looked into spark_apply
to apply mice
on it, but didn't figure out a solution yet.
Thank you!
Upvotes: 3
Views: 958
Reputation: 330353
You can use Imputer
. Let's say data looks like this:
df <- copy_to(sc, tibble(id=1:3, x=c(1, NA, 3), y=c(NA, 2, -1)))
The transformer requires input and output column lists:
input_cols <- c("x", "y")
output_cols <- paste0(input_cols, "_imp")
and can be applied as shown below:
df %>%
ft_imputer(input_cols=input_cols, output_cols=output_cols, strategy="mean")
# Source: table<sparklyr_tmp_73a32e74369c> [?? x 5]
# Database: spark_connection
id x y x_imp y_imp
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1 NaN 1 0.5
2 2 NaN 2 2 2
3 3 3 -1 3 -1
Upvotes: 3