Pss
Pss

Reputation: 643

Why is this code working? [R] [dplyr] [mutate]

I have a numeric variable (FinalPerformance) that takes value from 0 to 100. I want to create factor levels off of that variable. So, I'm using the function quantile to divide the dataset into three parts. My problem is that the code is sorting the column correctly (I think) but I'm not sure why. See below:

Performance <- Performance %>% mutate(FactorPerformance = as.factor(case_when(
  
  FinalPerformance <= quantile(FinalPerformance, probs = seq(0,1,1/3))[[1]] ~ "Low",
  FinalPerformance <= quantile(FinalPerformance, probs = seq(0,1,1/3))[[2]] ~ "Medium",
  TRUE ~ "High"
)), FactorPerformance = fct_relevel(FactorPerformance, c("Low", "Medium", "High")))

I'm of the opinion that the "Medium" factor level should overwrite the "Low" factor level because I'm not specifying a between range. See below:

Performance <- Performance %>% mutate(FactorPerformance = as.factor(case_when(
  
  FinalPerformance <= quantile(FinalPerformance, probs = seq(0,1,1/3))[[1]] ~ "Low",
  FinalPerformance > quantile(FinalPerformance, probs = seq(0,1,1/3))[[1]] & 
    FinalPerformance <= quantile(FinalPerformance, probs = seq(0,1,1/3))[[2]] ~ "Medium",
  TRUE ~ "High"
)), FactorPerformance = fct_relevel(FactorPerformance, c("Low", "Medium", "High")))

I get the same result when I count values in FactorPerformance column. What am I missing?

Upvotes: 1

Views: 119

Answers (1)

Till
Till

Reputation: 6628

This really comes down to the way dplyr::case_when() works:

mutate() applies case_when() to each row of the data.frame. For every row it checks the conditional statements in the case_when() call. It starts from the top to look for a condition that evaluates to TRUE. As soon as one is found the value to the right of the tilde is returned, all further conditions are ignored and the interpreter moves on to the next row.

In regards to your code: If you want a categorization as "medium" to precede a "low", you have to list the "medium" condition before the "low" condition.

As a general explanation: Below is a simplified example. All three statements are TRUE, but since the line 10 %in% 5:10 ~ "foo" comes before the other conditional statement this returns "foo".

dplyr::case_when(
  10 %in% 5:10 ~ "foo",
  10 %in% 1:10 ~ "bar",
  TRUE ~ NA_character_
)

Here the two conditional statements are flipped around and the output is "bar".

dplyr::case_when(
  10 %in% 1:10 ~ "bar",
  10 %in% 5:10 ~ "foo",
  TRUE ~ NA_character_
)

This is also the reason why TRUE ~ NA_character_ works as a catch all for everything that is not captured by any of the previous statements.

Upvotes: 1

Related Questions