MarkusG
MarkusG

Reputation: 55

Create a character variable with values conditional on previous variables (both their names and their values) with dplyr

I have a dataframe with ~ 100 variables and several thousand observations. Some of these observations are disqualified from further analysis because of the values in some of their variables. Instead of just deleting these disqualified observations I want to create a character variable that indicates if the observation has been disqualified and if so, because of which variables (one observation can be disqualified because of multiple variables).

Most variables are numeric and can have one of the following values: -1, 0, 1. In addition to that, the variables that could potentially disqualify an observation can also take the value 99, which means disqualified.

# create example data
df <- data.frame(id = c(1:6),
                 AA_B = c(1, 0, NA, 1, -1, 99),
                 A_B_C = c(0, 0, 0, -1, 1, NA),
                 A_BB = c(-1, 99, 0, 0, -1, NA),
                 B_C = c(99, NA, 1, 99, 0, 99),
                 D_AC = c(1, 1, 1, 1, -1, -1))

If an observation is disqualified, the new variable "disqualify" should be something like Disqualified because of A_BB or Disqualified because of AA_B and B_C (depending on which variables caused the disqualification), otherwise it could be any other string or just NA. So, the result should look like the following:

> df
  id AA_B A_B_C A_BB B_C D_AC                           disqualify
1  1    1     0   -1  99    1          Disqualified because of B_C
2  2    0     0   99  NA    1         Disqualified because of A_BB
3  3   NA     0    0   1    1                                 <NA>
4  4    1    -1    0  99    1          Disqualified because of B_C
5  5   -1     1   -1   0   -1                                 <NA>
6  6   99    NA   NA  99   -1 Disqualified because of AA_B and B_C

I'm struggling to find a way to automatically include the variable names of those variables that caused the disqualification in the string of "disqualify". So far I have found the following solution, but this is a horrible peace of code and I am sure that there must be a better way to do this.

df <-
  df %>%
  mutate(disqualify = case_when(AA_B == 99 |
                                  A_BB == 99 | 
                                  B_C == 99 ~ paste("Disqualified because of",
                                                    case_when(AA_B == 99 & (is.na(A_BB) | A_BB != 99) & (is.na(B_C) | B_C != 99) ~ deparse(substitute(AA_B)),
                                                              AA_B == 99 & A_BB == 99 & (is.na(B_C) | B_C != 99) ~ paste(deparse(substitute(AA_B)), deparse(substitute(A_BB)), sep = " and "),
                                                              AA_B == 99 & A_BB == 99 & B_C == 99 ~ paste(deparse(substitute(AA_B)), deparse(substitute(A_BB)), deparse(substitute(B_C)), sep = " and "),
                                                              AA_B == 99 & (is.na(A_BB) | A_BB != 99) & B_C == 99 ~ paste(deparse(substitute(AA_B)), deparse(substitute(B_C)), sep = " and "),
                                                              (is.na(AA_B) | AA_B != 99) & A_BB == 99 & B_C == 99 ~ paste(deparse(substitute(A_BB)), deparse(substitute(B_C)), sep = " and "),
                                                              (is.na(AA_B) | AA_B != 99) & A_BB == 99 & (is.na(B_C) | B_C != 99) ~ deparse(substitute(A_BB)),
                                                              (is.na(AA_B) | AA_B != 99) & (is.na(A_BB) | A_BB != 99) & B_C == 99 ~ deparse(substitute(B_C))
                                                              ))))

If possible, I'd favour a dplyr solution that allows me to call the disqualifying variables by their variables names (no indexing).

And, as a cherry on top, it would be fantastic if there was a way to replace the variable names in my output variable by another string. So Disqualified because of A_BB could become Disqualified because of Weather.

Any help is appreciated!

Upvotes: 3

Views: 321

Answers (3)

A. Suliman
A. Suliman

Reputation: 13125

library(dplyr)
df %>%
   #Check for 99 in specific columns
   mutate(disqualify = apply(.[,c('AA_B','A_B_C','A_BB','B_C')], 1, function(x) ifelse(any(x==99), 
                                   paste0("Disqualified because of ", paste(names(x[!is.na(x) & x==99]), collapse = " and ")), 
                                   NA)))

  id AA_B A_B_C A_BB B_C D_AC                           disqualify
1  1    1     0   -1  99    1          Disqualified because of B_C
2  2    0     0   99  NA    1         Disqualified because of A_BB
3  3   NA     0    0   1    1                                 <NA>
4  4    1    -1    0  99    1          Disqualified because of B_C
5  5   -1     1   -1   0   -1                                 <NA>
6  6   99    NA   NA  99   -1 Disqualified because of AA_B and B_C

#Base R
df$disqualify <- apply(df[,c('AA_B','A_B_C','A_BB','B_C')], 1, function(x) ifelse(any(x==99), 
                                                             paste0("Disqualified because of ", paste(names(x[!is.na(x) & x==99]), collapse = " and ")), 
                                                             NA))

In base R we can apply a function over dataframe rows/columns depending on whether you pass 1 or 2. Here we need to apply a function into every row hence we used 1. see ?apply for more details.

Upvotes: 2

Zahiro Mor
Zahiro Mor

Reputation: 1718

a base one-liner :

df$disqualify <- apply(df,1,function(x)paste(names(which(x==99)),collapse = " and "))
> df
  id AA_B A_B_C A_BB B_C D_AC      disqualify
1  1    1     0   -1  99    1             B_C
2  2    0     0   99  NA    1            A_BB
3  3   NA     0    0   1    1                
4  4    1    -1    0  99    1             B_C
5  5   -1     1   -1   0   -1                
6  6   99    NA   NA  99   -1   AA_B and B_C

to have it exactly the way you wanted it - you can add:

df$disqualify <- ifelse(test = df$disqualify=="", yes = NA, no = paste('Disqualified because of ',df$disqualify))

> df
  id AA_B A_B_C A_BB B_C D_AC                            disqualify
1  1    1     0   -1  99    1          Disqualified because of  B_C
2  2    0     0   99  NA    1         Disqualified because of  A_BB
3  3   NA     0    0   1    1                                  <NA>
4  4    1    -1    0  99    1          Disqualified because of  B_C
5  5   -1     1   -1   0   -1                                  <NA>
6  6   99    NA   NA  99   -1 Disqualified because of  AA_B and B_C

if you want to change the names of the columns - why don't do that before this operation with
names(df) <- c("id","Weather", "Climate","Name3"...)

Upvotes: 0

tmfmnk
tmfmnk

Reputation: 39858

One dplyr and tidyr option could be:

df %>%
 left_join(df %>%
            pivot_longer(names_to = "variables", values_to = "values", -id, values_drop_na = TRUE) %>%
            group_by(id) %>%
            summarise(disqualify = if_else(all(values != 99), 
                                           NA_character_, 
                                           paste("Disqualified because of", paste0(variables[values == 99], collapse = " and ")))),
           by = c("id" = "id"))

  id AA_B A_B_C A_BB B_C D_AC                           disqualify
1  1    1     0   -1  99    1          Disqualified because of B_C
2  2    0     0   99  NA    1         Disqualified because of A_BB
3  3   NA     0    0   1    1                                 <NA>
4  4    1    -1    0  99    1          Disqualified because of B_C
5  5   -1     1   -1   0   -1                                 <NA>
6  6   99    NA   NA  99   -1 Disqualified because of AA_B and B_C

Upvotes: 2

Related Questions