Reputation: 55
I have a dataframe with ~ 100 variables and several thousand observations. Some of these observations are disqualified from further analysis because of the values in some of their variables. Instead of just deleting these disqualified observations I want to create a character variable that indicates if the observation has been disqualified and if so, because of which variables (one observation can be disqualified because of multiple variables).
Most variables are numeric and can have one of the following values: -1
, 0
, 1
. In addition to that, the variables that could potentially disqualify an observation can also take the value 99
, which means disqualified.
# create example data
df <- data.frame(id = c(1:6),
AA_B = c(1, 0, NA, 1, -1, 99),
A_B_C = c(0, 0, 0, -1, 1, NA),
A_BB = c(-1, 99, 0, 0, -1, NA),
B_C = c(99, NA, 1, 99, 0, 99),
D_AC = c(1, 1, 1, 1, -1, -1))
If an observation is disqualified, the new variable "disqualify" should be something like Disqualified because of A_BB
or Disqualified because of AA_B and B_C
(depending on which variables caused the disqualification), otherwise it could be any other string or just NA
. So, the result should look like the following:
> df
id AA_B A_B_C A_BB B_C D_AC disqualify
1 1 1 0 -1 99 1 Disqualified because of B_C
2 2 0 0 99 NA 1 Disqualified because of A_BB
3 3 NA 0 0 1 1 <NA>
4 4 1 -1 0 99 1 Disqualified because of B_C
5 5 -1 1 -1 0 -1 <NA>
6 6 99 NA NA 99 -1 Disqualified because of AA_B and B_C
I'm struggling to find a way to automatically include the variable names of those variables that caused the disqualification in the string of "disqualify". So far I have found the following solution, but this is a horrible peace of code and I am sure that there must be a better way to do this.
df <-
df %>%
mutate(disqualify = case_when(AA_B == 99 |
A_BB == 99 |
B_C == 99 ~ paste("Disqualified because of",
case_when(AA_B == 99 & (is.na(A_BB) | A_BB != 99) & (is.na(B_C) | B_C != 99) ~ deparse(substitute(AA_B)),
AA_B == 99 & A_BB == 99 & (is.na(B_C) | B_C != 99) ~ paste(deparse(substitute(AA_B)), deparse(substitute(A_BB)), sep = " and "),
AA_B == 99 & A_BB == 99 & B_C == 99 ~ paste(deparse(substitute(AA_B)), deparse(substitute(A_BB)), deparse(substitute(B_C)), sep = " and "),
AA_B == 99 & (is.na(A_BB) | A_BB != 99) & B_C == 99 ~ paste(deparse(substitute(AA_B)), deparse(substitute(B_C)), sep = " and "),
(is.na(AA_B) | AA_B != 99) & A_BB == 99 & B_C == 99 ~ paste(deparse(substitute(A_BB)), deparse(substitute(B_C)), sep = " and "),
(is.na(AA_B) | AA_B != 99) & A_BB == 99 & (is.na(B_C) | B_C != 99) ~ deparse(substitute(A_BB)),
(is.na(AA_B) | AA_B != 99) & (is.na(A_BB) | A_BB != 99) & B_C == 99 ~ deparse(substitute(B_C))
))))
If possible, I'd favour a dplyr solution that allows me to call the disqualifying variables by their variables names (no indexing).
And, as a cherry on top, it would be fantastic if there was a way to replace the variable names in my output variable by another string. So Disqualified because of A_BB
could become Disqualified because of Weather
.
Any help is appreciated!
Upvotes: 3
Views: 321
Reputation: 13125
library(dplyr)
df %>%
#Check for 99 in specific columns
mutate(disqualify = apply(.[,c('AA_B','A_B_C','A_BB','B_C')], 1, function(x) ifelse(any(x==99),
paste0("Disqualified because of ", paste(names(x[!is.na(x) & x==99]), collapse = " and ")),
NA)))
id AA_B A_B_C A_BB B_C D_AC disqualify
1 1 1 0 -1 99 1 Disqualified because of B_C
2 2 0 0 99 NA 1 Disqualified because of A_BB
3 3 NA 0 0 1 1 <NA>
4 4 1 -1 0 99 1 Disqualified because of B_C
5 5 -1 1 -1 0 -1 <NA>
6 6 99 NA NA 99 -1 Disqualified because of AA_B and B_C
#Base R
df$disqualify <- apply(df[,c('AA_B','A_B_C','A_BB','B_C')], 1, function(x) ifelse(any(x==99),
paste0("Disqualified because of ", paste(names(x[!is.na(x) & x==99]), collapse = " and ")),
NA))
In base R we can apply
a function over dataframe rows/columns depending on whether you pass 1 or 2. Here we need to apply
a function into every row hence we used 1. see ?apply
for more details.
Upvotes: 2
Reputation: 1718
a base one-liner :
df$disqualify <- apply(df,1,function(x)paste(names(which(x==99)),collapse = " and "))
> df
id AA_B A_B_C A_BB B_C D_AC disqualify
1 1 1 0 -1 99 1 B_C
2 2 0 0 99 NA 1 A_BB
3 3 NA 0 0 1 1
4 4 1 -1 0 99 1 B_C
5 5 -1 1 -1 0 -1
6 6 99 NA NA 99 -1 AA_B and B_C
to have it exactly the way you wanted it - you can add:
df$disqualify <- ifelse(test = df$disqualify=="", yes = NA, no = paste('Disqualified because of ',df$disqualify))
> df
id AA_B A_B_C A_BB B_C D_AC disqualify
1 1 1 0 -1 99 1 Disqualified because of B_C
2 2 0 0 99 NA 1 Disqualified because of A_BB
3 3 NA 0 0 1 1 <NA>
4 4 1 -1 0 99 1 Disqualified because of B_C
5 5 -1 1 -1 0 -1 <NA>
6 6 99 NA NA 99 -1 Disqualified because of AA_B and B_C
if you want to change the names of the columns - why don't do that before this operation with
names(df) <- c("id","Weather", "Climate","Name3"...)
Upvotes: 0
Reputation: 39858
One dplyr
and tidyr
option could be:
df %>%
left_join(df %>%
pivot_longer(names_to = "variables", values_to = "values", -id, values_drop_na = TRUE) %>%
group_by(id) %>%
summarise(disqualify = if_else(all(values != 99),
NA_character_,
paste("Disqualified because of", paste0(variables[values == 99], collapse = " and ")))),
by = c("id" = "id"))
id AA_B A_B_C A_BB B_C D_AC disqualify
1 1 1 0 -1 99 1 Disqualified because of B_C
2 2 0 0 99 NA 1 Disqualified because of A_BB
3 3 NA 0 0 1 1 <NA>
4 4 1 -1 0 99 1 Disqualified because of B_C
5 5 -1 1 -1 0 -1 <NA>
6 6 99 NA NA 99 -1 Disqualified because of AA_B and B_C
Upvotes: 2