J. Doe
J. Doe

Reputation: 1740

R - consolidate missing data in multiple columns

Say that you have a data frame named df with 25 columns, first of them being named ID, second to thirteenth having names from A1 to A12 and fourteenth to twenty-fifth having names from B1 to B12. Values in the A and B variables can be missing data.

The task that I am facing is to consolidate the missingness in the data - if there is a missing entry in, say, 8th row of A4, then the 8th row of B4 also needs to be updated to NA, even if it has some data. This also works vice versa, if there is a missing entry in, say, 19th row of B11, then the 19th row of A11 also needs to be missing.

I can do this with two for loops:

for(i in 2:13){
  for(j in 1:nrow(df)){
    if(is.na(df[j,i+12])){
      df[j,i] <- NA
    }
  }
}

for(i in 14:25){
  for(j in 1:nrow(df)){
    if(is.na(df[j,i-12])){
      df[j,i] <- NA
    }
  }
}

However, I am looking for a solution that doesn't include for loops and is preferrably in tidyverse. How could this be done more efficiently?

Upvotes: 1

Views: 62

Answers (5)

Aur&#232;le
Aur&#232;le

Reputation: 12819

We could create masks of the NA values in both sub-tables, combine them, and apply them back to both sub-tables:

na_mask <- is.na(df[2:13]) | is.na(df[14:25])

df[2:13][na_mask] <- NA
df[14:25][na_mask] <- NA

Upvotes: 1

Mossa
Mossa

Reputation: 1708

I've got a solution that showcases some advantages of reformatting one's data.

Let me first generate some data (since we received none)

library(tidyverse)

sample_nrows <- 10
full_df <- 
  tibble(
    ID = rep(seq_len(sample_nrows), each = 12 + 12),
    name = c(str_c("A", 1:12),
             str_c("B", 1:12)) %>%
      rep(sample_nrows),
    value = 
      rgamma(
        n = 12 * 2 * sample_nrows,
        shape = sample.int(20, size = 10)
      ) %>% 
      round())

full_df %>% 
  #' add 10% missingness
  mutate(value = if_else(rbinom(n(), size = 1, prob = 0.1) %>% as.logical(), NA_real_, value)) %>% 
  #' reconstruct into wide-format
  pivot_wider() %>%
  print(n = Inf, width = Inf) ->
  partial_df

Thus, we generate gamma-distributed data, for sample_nrows-rows (that's full_df), and we add 10% missing data to the whole thing and call it partial_df.

This formatting gives us a novel idea. Working with the long-format will give this result...

partial_df %>% 
  pivot_longer(-ID) %>% 
  tidyr::extract(name, c("name", "var_id"), regex = "(\\D+)(\\d+)") %>% 
  pivot_wider(names_from = "name") %>% 
  mutate(
    both_na = is.na(A) | is.na(B),
    A = if_else(both_na, NA_real_, A),
    B = if_else(both_na, NA_real_, B),
    both_na = NULL
  ) -> partial_df_with_NAs

To get back to the original format:

partial_df_with_NAs %>%
  pivot_wider(names_from = var_id, values_from = c(A,B), names_sep = "") %>% 
  print(n = Inf, width = Inf)

Upvotes: 0

Aur&#232;le
Aur&#232;le

Reputation: 12819

We could pivot to long format, then in a given row that contains NA, replace all values by NA, then pivot back to wide:

spec <- 
  df %>% 
  build_longer_spec(cols = -ID,
                    names_to = c(".value", "set"),
                    names_pattern = "(.+)(\\d+)",
                    values_to = "value")

df %>% 
  pivot_longer_spec(spec) %>% 
  print() %>% 
  # Intermediary long format:
  #> # A tibble: 6 x 4
  #>      ID set       A     B
  #>   <int> <chr> <dbl> <dbl>
  #> 1     1 1         1    NA
  #> 2     1 2         4    10
  #> 3     2 1         2     8
  #> 4     2 2         5    NA
  #> 5     3 1         3     9
  #> 6     3 2        NA    12
  rowwise(ID, set) %>% 
  mutate(across(everything(), 
                ~ ifelse(any(is.na(c_across(everything()))), NA, .x))) %>% 
  pivot_wider_spec(spec)

#> # A tibble: 3 x 5
#>      ID    A1    A2    B1    B2
#>   <int> <dbl> <dbl> <dbl> <dbl>
#> 1     1    NA     4    NA    10
#> 2     2     2    NA     8    NA
#> 3     3     3    NA     9    NA

With sample data:

df <- data.frame(
  ID = 1:3,
  A1 = c(1, 2, 3),
  A2 = c(4, 5, NA),
  B1 = c(NA, 8, 9),
  B2 = c(10, NA, 12)
)
df
#>   ID A1 A2 B1 B2
#> 1  1  1  4 NA 10
#> 2  2  2  5  8 NA
#> 3  3  3 NA  9 12

Upvotes: 1

Daniel Hoop
Daniel Hoop

Reputation: 692

I believe that the suggested solution should be sufficiently performant. Looping over the columns is really not a big deal.

Thanks for voting up and accepting as answer if you like it.

# Create data
nrows <- 10
ncols <- 25
df1 <- as.data.frame(matrix(1:(nrows*ncols), nrow = nrows))
colnames(df1) <- c(
  "ID",
  paste0("A", 1:12),
  paste0("B", 1:12)
)

df1[1:3, c("A1")] <- NA
df1[5:7, c("B5")] <- NA

# Prepare calculation
cols <- as.list(as.data.frame(
  matrix(c(paste0("A", 1:12),
           paste0("B", 1:12)), nrow = 2, byrow = TRUE)
))

# Do calculation
for (col in cols) {
  missing <- is.na(rowSums(df1[col]))
  df1[missing, col] <- NA
}

Upvotes: 0

user63230
user63230

Reputation: 4636

How about this?

#create dataset
library(tidyverse)
library(missForest)
df <- data.frame(id = c(1:10))
df[paste0("a", 1:10)] <- lapply(1:10, function(x) rnorm(10, x))
df[paste0("b", 1:10)] <- lapply(1:10, function(x) rnorm(10, x))
df <- bind_cols(df[1], missForest::prodNA(df[-1], noNA = 0.2)) #add NAs
df

purrr::map over variables:

df[paste0("a", 1:10)] <- map2(df %>% select(starts_with("a")), df %>% select(starts_with("b")),
                                                   ~ ifelse(is.na(.y), NA, .x))
df[paste0("b", 1:10)] <- map2(df %>% select(starts_with("b")), df %>% select(starts_with("a")),
                              ~ ifelse(is.na(.y), NA, .x))
df

Upvotes: 1

Related Questions