maia-sh
maia-sh

Reputation: 641

`case_when()` passes NA through when !=

I am trying to create a new column that indicates differences between two existing columns. NAs should be considered values and should be marked as "difference". However, NAs are being "passed through" the != comparator. I have looked for case_when arguments to deal with NAs and looked for alternative not equal comparators to no avail.

The below reprex shows the current output and the desired output.

Thank you in advance for your help!

library(dplyr)
library(tidyr)
library(tibble)

df <- 
  expand_grid(x = c("a", NA), y = c("b", NA)) %>% 
  add_row(x = "a", y = "a") %>% 
  add_row(x = "b", y = "b")

df
#> # A tibble: 6 x 2
#>   x     y    
#>   <chr> <chr>
#> 1 a     b    
#> 2 a     <NA> 
#> 3 <NA>  b    
#> 4 <NA>  <NA> 
#> 5 a     a    
#> 6 b     b

# Non-desired output: NA's passed through instead of treated as values
  df %>% 
  mutate(z = case_when(
    x == "a" & y == "a" ~ "a",
    x == "b" &  y == "b" ~ "b",
    x != y ~ "difference"
  ))
#> # A tibble: 6 x 3
#>   x     y     z         
#>   <chr> <chr> <chr>     
#> 1 a     b     difference
#> 2 a     <NA>  <NA>      
#> 3 <NA>  b     <NA>      
#> 4 <NA>  <NA>  <NA>      
#> 5 a     a     a         
#> 6 b     b     b

# Desired output
  df %>% 
  add_column(z = c(rep("difference", 3), NA_character_, "a", "b"))
#> # A tibble: 6 x 3
#>   x     y     z         
#>   <chr> <chr> <chr>     
#> 1 a     b     difference
#> 2 a     <NA>  difference
#> 3 <NA>  b     difference
#> 4 <NA>  <NA>  <NA>      
#> 5 a     a     a         
#> 6 b     b     b

Created on 2020-08-06 by the reprex package (v0.3.0)

Upvotes: 2

Views: 2924

Answers (2)

maia-sh
maia-sh

Reputation: 641

Like @akrun mentioned, there's a workaround with is.na with "exclusive or"/xor. Here's what I ended up using:

df %>% 
  mutate(z = case_when(
    x == y ~ x,
    xor(is.na(x), is.na(y)) ~ "difference",
    x != y ~ "difference",
    is.na(x) & is.na(y) ~ NA_character_
  ))

Upvotes: 1

akrun
akrun

Reputation: 887038

The issue is with == and NA. Any value compared to NA returns NA. It can be corrected with is.na also in the comparison, but then it needs to be repeated. Or else an easy fix is to change the NA to a different value, do the comparison and bind with the original dataset

library(dplyr)
df %>% 
   mutate(across(x:y, replace_na, '')) %>%
   transmute(z = case_when(
    x == "a" & y == "a" ~ "a",
    x == "b" &  y == "b" ~ "b",
    x != y ~ "difference"
    )) %>%
   bind_cols(df, .)

Upvotes: 2

Related Questions