Reputation: 143
I currently am struggling to get information row by row on which variables turned out well/which are NA. Below, I show an example/dput output of a section of my data.
head(big_test)
# A tibble: 3 x 19
id ctr_n ctr yr mn nvvi ENP_nat ENP_avg ENP_wght inflation1 inflation2 inflation3 inflation4 PSNS PSNS_s PSNS_w
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
1 1854 Isra~ 376 2019 4 1 3.50 3.50 3.50 NA NA NA NA NA 0.962 NA
2 1855 Isra~ 376 2019 9 1 2.51 2.51 2.51 NA NA NA NA NA 0.992 NA
3 1856 Isra~ 376 2020 3 1 3.78 3.78 3.78 NA NA NA NA NA 0.999 NA
# ... with 3 more variables: PSNS_sw <chr>, local_E <dbl>, cst_tot <dbl>
dput(big_test)
structure(list(id = c(1854, 1855, 1856), ctr_n = c("Israel",
"Israel", "Israel"), ctr = c(376, 376, 376), yr = c(2019, 2019,
2020), mn = c(4, 9, 3), nvvi = c(1, 1, 1), ENP_nat = c(3.50348063163162,
2.51319610127466, 3.78468892335972), ENP_avg = c(3.50348063163162,
2.51319610127466, 3.78468892335972), ENP_wght = c(3.50348063163162,
2.51319610127466, 3.78468892335972), inflation1 = c("NA", "NA",
"NA"), inflation2 = c("NA", "NA", "NA"), inflation3 = c("NA",
"NA", "NA"), inflation4 = c("NA", "NA", "NA"), PSNS = c("NA",
"NA", "NA"), PSNS_s = c(0.961748183147869, 0.992275075925835,
0.998547438416594), PSNS_w = c("NA", "NA", "NA"), PSNS_sw = c("NA",
"NA", "NA"), local_E = c(1, 1, 1), cst_tot = c(1, 1, 1)), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))
EDIT: The NAs here are in quotes, which is inappropriate. I think the issue came from writing to .xlsx; the correct version is listed below in a dput
output without the quotes on the NAs.
As you can see, the data is separated at the national level by election, where each row here should be unique (i.e., Israel, 2019, month 4). I am looking to create a character column which lists the variables missing in this output. Here is an example of the column desired:
desired_output <- tibble(missing_vars=paste("inflation1","inflation2","inflation3","inflation4","etc",sep=";"))
head(desired_output)
# A tibble: 1 x 1
missing_vars
<chr>
1 inflation1;inflation2;inflation3;inflation4;etc
Thus, I am wondering if there is some sort of loop that can slice a unique election, look at missing columns, then vector those that are missing? This is crucial to automate because some variables can be missing where others are present for the same country/year. I have tried to count them, but I cannot figure out how to list those column names as a character column.
Any help is appreciate. Thank you!
Upvotes: 1
Views: 39
Reputation: 887531
Here is an option with tidyverse
, where we reshape to 'long' format with pivot_longer
, grouped by the row_number()
, paste
the column names where there are missing value in the 'value'
library(dplyr)
library(tidyr)
library(stringr)
big_test %>%
select(starts_with('inflation')) %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = -rn) %>%
group_by(rn) %>%
summarise(missing_vars = str_c(name[is.na(value)], collapse=";"),
.groups = 'drop') %>%
select(-rn)
Without reshaping, an option is rowwise/c_across
big_test %>%
rowwise %>%
transmute(missing_vars = str_c(names(select(cur_data(),
starts_with('inflation')))[which(c_across(starts_with('inflation'))
== 'NA')], collapse=";"))
Here, it is comparing (==
) with "NA". If it is a real NA
, use is.na
instead of ==
Upvotes: 2
Reputation: 17785
As noted by @akrun, you have "NA" strings instead of NA. Once that is fixed, you can define a missing function and apply it to each row to create a new variable:
big_test <- structure(list(id = c(1854, 1855, 1856), ctr_n = c("Israel",
"Israel", "Israel"), ctr = c(376, 376, 376), yr = c(2019, 2019,
2020), mn = c(4, 9, 3), nvvi = c(1, 1, 1), ENP_nat = c(3.50348063163162,
2.51319610127466, 3.78468892335972), ENP_avg = c(3.50348063163162,
2.51319610127466, 3.78468892335972), ENP_wght = c(3.50348063163162,
2.51319610127466, 3.78468892335972), inflation1 = c(NA, NA,
NA), inflation2 = c(NA, NA, NA), inflation3 = c(NA,
NA, NA), inflation4 = c(NA, NA, NA), PSNS = c(NA,
NA, NA), PSNS_s = c(0.961748183147869, 0.992275075925835,
0.998547438416594), PSNS_w = c(NA, NA, NA), PSNS_sw = c(NA,
NA, NA), local_E = c(1, 1, 1), cst_tot = c(1, 1, 1)), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))
missing <- function(x) {
idx <- is.na(unlist(x))
paste(colnames(big_test)[idx], collapse=", ")
}
big_test$missing <- apply(big_test, 1, missing)
big_test$missing
#> [1] "inflation1, inflation2, inflation3, inflation4, PSNS, PSNS_w, PSNS_sw"
#> [2] "inflation1, inflation2, inflation3, inflation4, PSNS, PSNS_w, PSNS_sw"
#> [3] "inflation1, inflation2, inflation3, inflation4, PSNS, PSNS_w, PSNS_sw"
Upvotes: 1