PageSim
PageSim

Reputation: 143

Checking for NAs by Row, then Aggregating their Column Names

I currently am struggling to get information row by row on which variables turned out well/which are NA. Below, I show an example/dput output of a section of my data.

head(big_test)
# A tibble: 3 x 19
     id ctr_n   ctr    yr    mn  nvvi ENP_nat ENP_avg ENP_wght inflation1 inflation2 inflation3 inflation4 PSNS  PSNS_s PSNS_w
  <dbl> <chr> <dbl> <dbl> <dbl> <dbl>   <dbl>   <dbl>    <dbl> <chr>      <chr>      <chr>      <chr>      <chr>  <dbl> <chr> 
1  1854 Isra~   376  2019     4     1    3.50    3.50     3.50 NA         NA         NA         NA         NA     0.962 NA    
2  1855 Isra~   376  2019     9     1    2.51    2.51     2.51 NA         NA         NA         NA         NA     0.992 NA    
3  1856 Isra~   376  2020     3     1    3.78    3.78     3.78 NA         NA         NA         NA         NA     0.999 NA    
# ... with 3 more variables: PSNS_sw <chr>, local_E <dbl>, cst_tot <dbl>

dput(big_test)
structure(list(id = c(1854, 1855, 1856), ctr_n = c("Israel", 
"Israel", "Israel"), ctr = c(376, 376, 376), yr = c(2019, 2019, 
2020), mn = c(4, 9, 3), nvvi = c(1, 1, 1), ENP_nat = c(3.50348063163162, 
2.51319610127466, 3.78468892335972), ENP_avg = c(3.50348063163162, 
2.51319610127466, 3.78468892335972), ENP_wght = c(3.50348063163162, 
2.51319610127466, 3.78468892335972), inflation1 = c("NA", "NA", 
"NA"), inflation2 = c("NA", "NA", "NA"), inflation3 = c("NA", 
"NA", "NA"), inflation4 = c("NA", "NA", "NA"), PSNS = c("NA", 
"NA", "NA"), PSNS_s = c(0.961748183147869, 0.992275075925835, 
0.998547438416594), PSNS_w = c("NA", "NA", "NA"), PSNS_sw = c("NA", 
"NA", "NA"), local_E = c(1, 1, 1), cst_tot = c(1, 1, 1)), row.names = c(NA, 
-3L), class = c("tbl_df", "tbl", "data.frame"))

EDIT: The NAs here are in quotes, which is inappropriate. I think the issue came from writing to .xlsx; the correct version is listed below in a dput output without the quotes on the NAs.

As you can see, the data is separated at the national level by election, where each row here should be unique (i.e., Israel, 2019, month 4). I am looking to create a character column which lists the variables missing in this output. Here is an example of the column desired:

desired_output <- tibble(missing_vars=paste("inflation1","inflation2","inflation3","inflation4","etc",sep=";"))

head(desired_output)
# A tibble: 1 x 1
  missing_vars                                   
  <chr>                                          
1 inflation1;inflation2;inflation3;inflation4;etc

Thus, I am wondering if there is some sort of loop that can slice a unique election, look at missing columns, then vector those that are missing? This is crucial to automate because some variables can be missing where others are present for the same country/year. I have tried to count them, but I cannot figure out how to list those column names as a character column.

Any help is appreciate. Thank you!

Upvotes: 1

Views: 39

Answers (2)

akrun
akrun

Reputation: 887531

Here is an option with tidyverse, where we reshape to 'long' format with pivot_longer, grouped by the row_number(), paste the column names where there are missing value in the 'value'

library(dplyr)
library(tidyr)
library(stringr)
big_test %>%
     select(starts_with('inflation')) %>% 
     mutate(rn = row_number()) %>% 
     pivot_longer(cols = -rn) %>% 
     group_by(rn) %>%
     summarise(missing_vars = str_c(name[is.na(value)], collapse=";"),
          .groups = 'drop') %>%
     select(-rn)          

Without reshaping, an option is rowwise/c_across

big_test %>% 
   rowwise %>% 
   transmute(missing_vars = str_c(names(select(cur_data(), 
      starts_with('inflation')))[which(c_across(starts_with('inflation')) 
        == 'NA')], collapse=";"))

Here, it is comparing (==) with "NA". If it is a real NA, use is.na instead of ==

Upvotes: 2

Vincent
Vincent

Reputation: 17785

As noted by @akrun, you have "NA" strings instead of NA. Once that is fixed, you can define a missing function and apply it to each row to create a new variable:

big_test <- structure(list(id = c(1854, 1855, 1856), ctr_n = c("Israel", 
"Israel", "Israel"), ctr = c(376, 376, 376), yr = c(2019, 2019, 
2020), mn = c(4, 9, 3), nvvi = c(1, 1, 1), ENP_nat = c(3.50348063163162, 
2.51319610127466, 3.78468892335972), ENP_avg = c(3.50348063163162, 
2.51319610127466, 3.78468892335972), ENP_wght = c(3.50348063163162, 
2.51319610127466, 3.78468892335972), inflation1 = c(NA, NA, 
NA), inflation2 = c(NA, NA, NA), inflation3 = c(NA, 
NA, NA), inflation4 = c(NA, NA, NA), PSNS = c(NA, 
NA, NA), PSNS_s = c(0.961748183147869, 0.992275075925835, 
0.998547438416594), PSNS_w = c(NA, NA, NA), PSNS_sw = c(NA, 
NA, NA), local_E = c(1, 1, 1), cst_tot = c(1, 1, 1)), row.names = c(NA, 
-3L), class = c("tbl_df", "tbl", "data.frame"))

missing <- function(x) {
  idx <- is.na(unlist(x))
  paste(colnames(big_test)[idx], collapse=", ")
}

big_test$missing <- apply(big_test, 1, missing)

big_test$missing
#> [1] "inflation1, inflation2, inflation3, inflation4, PSNS, PSNS_w, PSNS_sw"
#> [2] "inflation1, inflation2, inflation3, inflation4, PSNS, PSNS_w, PSNS_sw"
#> [3] "inflation1, inflation2, inflation3, inflation4, PSNS, PSNS_w, PSNS_sw"

Upvotes: 1

Related Questions