steve---g
steve---g

Reputation: 385

R, dplyr: how to change the value in one column to NA based on NA values in other columns (using OR operator)

This problem is hard to express in a title line, but it is a different question than the usual "how do I change the value in one column based on a value in another column." There are two wrinkles I haven't seen dealt with elsewhere: 1) I'm dealing with NAs, not values, 2) I'm dealing with a condition involving an OR relationship between multiple columns, i.e., if V01 is NA or V02 is NA, then make var foo (which already exists with no NAs) NA for that row.

Here's a small tibble to illustrate:

tbl <- tibble(
  V01 = 10:15,
  V02 = 1:6,
  V03 = V02 * 2,
  foo = V03 * 2
)
# sprinkle around some NAs
tbl[1,2] = NA
tbl[4,1] = NA
tbl[4,2] = NA
tbl[6,1:3] = NA

that produces this table:

# A tibble: 6 x 4
    V01   V02   V03   foo
  <int> <int> <dbl> <dbl>
1    10    NA     2     4
2    11     2     4     8
3    12     3     6    12
4    NA    NA     8    16
5    14     5    10    20
6    NA    NA    NA    24

Ultimately I want to produce this kind of logic for each row in tbl:

if (V01 == NA | V02 == NA | V03 == NA) {foo = NA}

I started by trying to use mutateto change foo to NA based on one column, like this:

tbl <- tbl %>% 
  mutate(
  foo = case_when(V01 == NA ~ NA)
  )

But even in this simple case, mutate changed every value of foo to NA:

# A tibble: 6 x 4
    V01   V02   V03 foo  
  <int> <int> <dbl> <lgl>
1    10    NA     2 NA   
2    11     2     4 NA   
3    12     3     6 NA   
4    NA    NA     8 NA   
5    14     5    10 NA   
6    NA    NA    NA NA   

I thought maybe I needed a rowwise in there, but that didn't change the result.

Then I thought maybe it's because NAs are weird, so let's try something simpler: change foo to -1 if V01 is NA:

tbl <- tbl %>% 
  mutate(
  foo = case_when(V01 == NA ~ -1)
  )

But that produced the same result, all values of foo changed to NA (and the value of foo in row 4 did not change to -1 either).

Then I decided to do something even simpler. Leave the NAs out of the picture. If V01 is 10 (as in the first row only), change the foo value in that row to -1. So:

tbl <- tbl %>% 
  mutate(
  foo = case_when(V01 == 10 ~ -1)
  )

And that produced this:

# A tibble: 6 x 4
    V01   V02   V03   foo
  <int> <int> <dbl> <dbl>
1    10    NA     2    -1
2    11     2     4    NA
3    12     3     6    NA
4    NA    NA     8    NA
5    14     5    10    NA
6    NA    NA    NA    NA

So why is mutate changing values in rows that don't meet the specified condition in the mutate statement?

Note I haven't even gotten to the OR condition yet. For what it's worth, this is the result I want to get:

# A tibble: 6 x 4
    V01   V02   V03   foo
  <int> <int> <dbl> <dbl>
1    10    NA     2    NA
2    11     2     4     8
3    12     3     6    12
4    NA    NA     8    NA
5    14     5    10    20
6    NA    NA    NA    NA

I wasn't able to use the dplyr na_if function because it only takes a value to replace with NA, not a condition.

Any guidance would be appreciated. Thanks!

Upvotes: 5

Views: 2495

Answers (6)

GuedesBF
GuedesBF

Reputation: 9858

Main solution

You can use if_any() and is.nato check rowwise for any NAs. if_any() obviates the need for either rowwise() or reduce()/ Reduce(), which makes it ideal for row-wise logical opperations.

library(dplyr)

tbl%>%mutate(foo=replace(foo, if_any(matches("V\\d{2}"), is.na), NA))

# A tibble: 6 x 4
    V01   V02   V03   foo
  <int> <int> <dbl> <dbl>
1    10    NA     2    NA
2    11     2     4     8
3    12     3     6    12
4    NA    NA     8    NA
5    14     5    10    20
6    NA    NA    NA    NA

Secondary but possibly very fast solution with map() and reduce():

library(dplyr)
library(purrr)

tbl%>%mutate(foo=ifelse(reduce(map(across(matches("V\\d{2}")), is.na), `|`), NA, foo))

if_any() is a new and very nice and cleaner alternative to the usual approaches to logical rowwise opperations with rowSums(condition) / pmap() +<-c(...) / rowwise()+c_across() / map()+Reduce() / apply(margin=1).

For interesting yet slightly different uses of this approach, see my answers in: R - Remove rows from dataframe that contain only zeros in numeric columns, base R and pipe-friendly methods? and
Find the value of a column over a set of columns in a data frame in R

EDIT

There is an even cleaner solution with if_all():

tbl%>%mutate(foo=foo*if_all(everything()))

Upvotes: 1

Anoushiravan R
Anoushiravan R

Reputation: 21908

We can also use this beside all the very good answers you received:

library(dplyr)
library(purrr)

tbl %>%
  mutate(foo = pmap_dbl(.,  ~ c(c(...)[length(tbl)], NA)[any(is.na(c(...)[-length(tbl)])) + 1]))

# A tibble: 6 x 4
    V01   V02   V03   foo
  <int> <int> <dbl> <dbl>
1    10    NA     2    NA
2    11     2     4     8
3    12     3     6    12
4    NA    NA     8    NA
5    14     5    10    20
6    NA    NA    NA    NA

Or this one also can be used:

tbl %>%
  mutate(foo = pmap_dbl(., ~ ifelse(any(is.na(c(...)[-length(tbl)])), NA, ..4)))

Upvotes: 2

steve---g
steve---g

Reputation: 385

Thank you for all these alternative solutions. What works best for me is a variation on the reduce()and map() solution, plugging in the actual list variables to check, rather than doing a match.

tbl <- tbl %>% 
  mutate(
  foo=ifelse(reduce(map(across(c("V01", "V02", "V03")), is.na), `|`), NA, foo)
  )

This is easiest since I'm working with an Election Studies dataset, which has 1300 variables, all starting with "V". Where this problem came from, if anyone is interested, is a byproduct of using the R psych package to build psychometric scales from some sets of variables, such as 4 questions that measure 'anti-govt' attitudes. Psych makes it very easy to create and test the reliability of a scale, but its default behavior is to give the scale variable a value even if all the input variables are NA. Given that the dataset has over 8,000 cases, I can afford the 10% or so decline in cases that results from assigning NA to the scale variable if any of the input variables are NA. A useful variant of this solution makes it easy to assign NA to the scale variable only if all the input variables are NA. Just change the | operator to an &. So this:

tbl <- tbl %>% 
  mutate(
  foo=ifelse(reduce(map(across(c("V01", "V02", "V03")), is.na), `&`), NA, foo)

produces this:

# A tibble: 6 x 4
    V01   V02   V03   foo
  <int> <int> <dbl> <dbl>
1    10    NA     2     4
2    11     2     4     8
3    12     3     6    12
4    NA    NA     8    16
5    14     5    10    20
6    NA    NA    NA    NA
  )

Another smaller side issue is the use of base R ifelse() vs dplyr if_else(). The latter doesn't work here because it requires both alternatives to be of the same type. This post is useful on the differences between the 2 functions.

Upvotes: -1

AnilGoyal
AnilGoyal

Reputation: 26218

Base R strategy

tbl$foo <- tbl$foo * as.logical(rowSums(tbl))

tbl
# A tibble: 6 x 4
    V01   V02   V03   foo
  <int> <int> <dbl> <dbl>
1    10    NA     2    NA
2    11     2     4     8
3    12     3     6    12
4    NA    NA     8    NA
5    14     5    10    20
6    NA    NA    NA    NA

using cur_data()

library(dplyr, warn.conflicts = F)

tbl %>% mutate(foo = pmin(foo, rowSums(cur_data())))

#> # A tibble: 6 x 4
#>     V01   V02   V03   foo
#>   <int> <int> <dbl> <dbl>
#> 1    10    NA     2    NA
#> 2    11     2     4     8
#> 3    12     3     6    12
#> 4    NA    NA     8    NA
#> 5    14     5    10    20
#> 6    NA    NA    NA    NA

Note: Assuming columns don't have negative values


If there are negative values, you may do

tbl %>% mutate(foo = foo * as.logical(rowSums(cur_data())))

# A tibble: 6 x 4
    V01   V02   V03   foo
  <int> <int> <dbl> <dbl>
1    10    NA     2    NA
2    11     2     4     8
3    12     3     6    12
4    NA    NA     8    NA
5    14     5    10    20
6    NA    NA    NA    NA

Upvotes: 2

jpdugo17
jpdugo17

Reputation: 7106

library(tidyverse)

tbl <- tibble(
    V01 = 10:15,
    V02 = 1:6,
    V03 = V02 * 2,
    foo = V03 * 2
)
# sprinkle around some NAs
tbl[1,2] = NA
tbl[4,1] = NA
tbl[4,2] = NA
tbl[6,1:3] = NA

#alternative faster way

criteria <- apply(tbl[, -4], 1, \(.x) .x) %>%
              as.data.frame() %>%
              map_lgl(~ any(is.na(.x))) 

mutate(tbl,foo = ifelse(criteria, NA, foo))
#> # A tibble: 6 x 4
#>     V01   V02   V03   foo
#>   <int> <int> <dbl> <dbl>
#> 1    10    NA     2    NA
#> 2    11     2     4     8
#> 3    12     3     6    12
#> 4    NA    NA     8    NA
#> 5    14     5    10    20
#> 6    NA    NA    NA    NA

tbl %>% rowwise() %>% 
    mutate(foo = ifelse(any(is.na(c_across(matches('[0-9]$')))), NA, foo))
#> # A tibble: 6 x 4
#> # Rowwise: 
#>     V01   V02   V03   foo
#>   <int> <int> <dbl> <dbl>
#> 1    10    NA     2    NA
#> 2    11     2     4     8
#> 3    12     3     6    12
#> 4    NA    NA     8    NA
#> 5    14     5    10    20
#> 6    NA    NA    NA    NA


#if foo has no NA's
tbl %>% rowwise() %>% 
    mutate(foo = ifelse(any(is.na(c_across(everything()))), NA, foo))
#> # A tibble: 6 x 4
#> # Rowwise: 
#>     V01   V02   V03   foo
#>   <int> <int> <dbl> <dbl>
#> 1    10    NA     2    NA
#> 2    11     2     4     8
#> 3    12     3     6    12
#> 4    NA    NA     8    NA
#> 5    14     5    10    20
#> 6    NA    NA    NA    NA

Created on 2021-06-16 by the reprex package (v2.0.0)

Created on 2021-06-16 by the reprex package (v2.0.0)

Foo column can also be created while looking for NA's

library(tidyverse)
#no foo column
tbl_no_foo <- tibble(
    V01 = 10:15,
    V02 = 1:6,
    V03 = V02 * 2
)
# sprinkle around some NAs
tbl_no_foo[1,2] = NA
tbl_no_foo[4,1] = NA
tbl_no_foo[4,2] = NA
tbl_no_foo[6,1:3] = NA


tbl_no_foo %>% rowwise() %>% 
    mutate(foo = ifelse(any(is.na(c_across(matches('[0-9]$')))), NA, V03 * 2))
#> # A tibble: 6 x 4
#> # Rowwise: 
#>     V01   V02   V03   foo
#>   <int> <int> <dbl> <dbl>
#> 1    10    NA     2    NA
#> 2    11     2     4     8
#> 3    12     3     6    12
#> 4    NA    NA     8    NA
#> 5    14     5    10    20
#> 6    NA    NA    NA    NA

Created on 2021-06-16 by the reprex package (v2.0.0)

Upvotes: 0

Ronak Shah
Ronak Shah

Reputation: 388797

Here are couple of options -

  1. Using rowSums -
library(dplyr)
tbl %>%
  mutate(foo = replace(foo, rowSums(is.na(select(., starts_with('V')))) > 0, NA))

#    V01   V02   V03   foo
#  <int> <int> <dbl> <dbl>
#1    10    NA     2    NA
#2    11     2     4     8
#3    12     3     6    12
#4    NA    NA     8    NA
#5    14     5    10    20
#6    NA    NA    NA    NA
  1. Using rowwise -
tbl %>%
  rowwise() %>%
  mutate(foo = if(any(is.na(c_across(starts_with('V'))))) NA else foo)

Both of this replace foo with NA if there is a NA value in any of the columns starting with V.


As far as this code is concerned

tbl <- tbl %>% 
  mutate(foo = case_when(V01 == 10 ~ -1))

case_when by default returns NA if no condition is satisfied. To get the same value back you can do

tbl %>% 
  mutate(foo = case_when(V01 == 10 ~ -1, 
                        TRUE ~ foo))

Upvotes: 1

Related Questions