user3262756
user3262756

Reputation: 679

Substitute NAs by value present in grouped variables

Fake data

fruit <- c("Orange", "Banana", "Orange", "Banana")
flavour <- c("Bitter", NA, NA, "Sweet")
geo <- c(NA, NA, NA, "France")
value <- c(1, NA, NA, 4)
dd <- data.frame(fruit, flavour, geo, value)
rm(fruit, flavour, geo, value)

enter image description here

I'd like to group the dataset by 'fruit' and substitute the missing values in all variables with the value present in the grouped data.

Desired output

fruit <- c("Orange", "Banana", "Orange", "Banana")
flavour <- c("Bitter", "Sweet", "Bitter", "Sweet")
geo <- c(NA, "France", NA, "France")
value <- c(1, 4, 1, 4)
dd2 <- data.frame(fruit, flavour, geo, value)
rm(fruit, flavour, geo, value)

enter image description here

Code attempt

tt <- dd %>%
  group_by(fruit) %>%
  summarise_all()

Upvotes: 3

Views: 47

Answers (2)

Jelger van Zaane
Jelger van Zaane

Reputation: 69

I created a function to do exactly this. It relies on dplyr::coalesce. It will only fill the values if all values in the group are equal. it's also possible to specify a minimum (n or Percentage) of known values, to prevent a single value to fill all values in a group

library(dplyr)
fill_missing <- function (x, min_known_n = NULL, min_known_p = NULL) {
    if (NA %in% x) {
        y <- na.omit(x)
        y_n_distinct <- length(unique(y))
        if (!is.null(min_known_n)) {
            known_n <- length(y)
            if (known_n < min_known_n) {
                return(x)
            }
        }
        if (!is.null(min_known_p)) {
            known_p <- length(y)/length(x)
            if (known_p < min_known_p) {
                return(x)
            }
        }
        if (y_n_distinct == 1) {
            x <- dplyr::coalesce(x, y[1])
        }

    }
    return(x)
}

dd %>% group_by(fruit) %>% 
    mutate_all(fill_missing)

Upvotes: 1

akrun
akrun

Reputation: 886938

We need mutate_all after the group_by (summarise/summarise_all is used to get a single row from multiple rows). Use na.locf from zoo to fill the NA elements with the adjacent non-NA elements for each column

library(zoo)
library(dplyr)
dd %>% 
  group_by(fruit) %>% 
  mutate_all(funs(na.locf(na.locf(., na.rm =  FALSE), 
                 fromLast = TRUE, na.rm = FALSE)))
# A tibble: 4 x 4
# Groups:   fruit [2]
#  fruit  flavour geo    value
#  <fct>  <fct>   <fct>  <dbl>
#1 Orange Bitter  <NA>       1
#2 Banana Sweet   France     4
#3 Orange Bitter  <NA>       1
#4 Banana Sweet   France     4

Upvotes: 4

Related Questions