hklovs
hklovs

Reputation: 642

replace values with NA in several columns

I have a huge dataset and wish to replace values in certain columns (VAR1, VAR2) with NA if they do not start with AA or DD.

Data:

DF<-tibble::tribble(
  ~ID,  ~VAR1,    ~VAR1DATE,  ~VAR2,    ~VAR2DATE,
   1L, "AABB", "2001-01-01", "BBAA", "2001-01-01",
   2L, "AACC", "2001-01-02", "AACC", "2001-01-02",
   3L, "CCDD", "2001-01-03", "DDCC", "2001-01-03",
   4L, "DDAA", "2001-01-04", "CCBB", "2001-01-04",
   5L, "CCBB", "2001-01-05", "CCBB", "2001-01-05"
  )

Desired output:

A tibble: 5 × 5
     ID VAR1  VAR1DATE   VAR2  VAR2DATE  
  <int> <chr> <chr>      <chr> <chr>     
1     1 AABB  2001-01-01 NA    NA        
2     2 AACC  2001-01-02 AACC  2001-01-02
3     3 NA    NA         DDCC  2001-01-03
4     4 DDAA  2001-01-04 NA    NA        
5     5 NA    NA         NA    NA  

Is there an elegant and smart way to this? mutate_all?

Upvotes: 7

Views: 2318

Answers (6)

Alan G&#243;mez
Alan G&#243;mez

Reputation: 378

Also a Base R solution could be simpler:

DF$VAR1DATE[grepl("(^[^AA|^DD].*$)", DF$VAR1)] <- NA
DF$VAR1[grepl("(^[^AA|^DD].*$)", DF$VAR1)] <- NA
DF$VAR2DATE[grepl("(^[^AA|^DD].*$)", DF$VAR2)] <- NA
DF$VAR2[grepl("(^[^AA|^DD].*$)", DF$VAR2)] <- NA

or with a loop:

DF <- as.data.frame(DF) 
for (i in 1:4) {
        DF[,2*(i%%2)+i][grepl("(^[^AA|^DD].*$)", DF[,i+(i%%2)])] <- NA
    }

OUTPUT:

DF
# A tibble: 5 × 5
     ID VAR1  VAR1DATE   VAR2  VAR2DATE  
  <int> <chr> <chr>      <chr> <chr>     
1     1 AABB  2001-01-01 NA    NA        
2     2 AACC  2001-01-02 AACC  2001-01-02
3     3 NA    NA         DDCC  2001-01-03
4     4 DDAA  2001-01-04 NA    NA        
5     5 NA    NA         NA    NA 

Upvotes: 0

PaulS
PaulS

Reputation: 25313

Another possibility, using tidyr::unite and tiydr::separate:

library(tidyverse)

DF<-tibble::tribble(
  ~ID,  ~VAR1,    ~VAR1DATE,  ~VAR2,    ~VAR2DATE,
  1L, "AABB", "2001-01-01", "BBAA", "2001-01-01",
  2L, "AACC", "2001-01-02", "AACC", "2001-01-02",
  3L, "CCDD", "2001-01-03", "DDCC", "2001-01-03",
  4L, "DDAA", "2001-01-04", "CCBB", "2001-01-04",
  5L, "CCBB", "2001-01-05", "CCBB", "2001-01-05"
)

DF %>% 
  unite(VAR1,VAR1,VAR1DATE) %>% unite(VAR2,VAR2,VAR2DATE) %>% 
  mutate(across(starts_with("VAR"),~if_else(str_detect(.x, "^AA|^DD"), .x, NA_character_))) %>% 
  separate(VAR1,into = c("VAR1", "VAR1DATE"), sep = "_") %>% 
  separate(VAR2,into = c("VAR2", "VAR2DATE"), sep = "_")

#> # A tibble: 5 × 5
#>      ID VAR1  VAR1DATE   VAR2  VAR2DATE  
#>   <int> <chr> <chr>      <chr> <chr>     
#> 1     1 AABB  2001-01-01 <NA>  <NA>      
#> 2     2 AACC  2001-01-02 AACC  2001-01-02
#> 3     3 <NA>  <NA>       DDCC  2001-01-03
#> 4     4 DDAA  2001-01-04 <NA>  <NA>      
#> 5     5 <NA>  <NA>       <NA>  <NA>

Upvotes: 2

AndrewGB
AndrewGB

Reputation: 16836

Here is another tidyverse solution with using str_detect to determine where to convert to NA for the date columns. Then, we can use the same type of function on VAR1 and VAR2.

library(tidyverse)

DF %>%
  rowwise %>%
  mutate(
    VAR1DATE = ifelse(str_detect(VAR1, '^BB|^CC') == TRUE, NA, VAR1DATE),
    VAR2DATE = ifelse(str_detect(VAR2, '^BB|^CC') == TRUE, NA, VAR2DATE)
  ) %>%
  mutate(across(c(VAR1, VAR2), function(x)
    ifelse(str_detect(x, '^BB|^CC') == TRUE, NA, x)))

Output

# A tibble: 5 × 5
# Rowwise: 
     ID VAR1  VAR1DATE   VAR2  VAR2DATE  
  <int> <chr> <chr>      <chr> <chr>     
1     1 AABB  2001-01-01 NA    NA        
2     2 AACC  2001-01-02 AACC  2001-01-02
3     3 NA    NA         DDCC  2001-01-03
4     4 DDAA  2001-01-04 NA    NA        
5     5 NA    NA         NA    NA      

Upvotes: 1

TarJae
TarJae

Reputation: 78907

Here is a tidyverse solution. Using across with str_replace_all and appending two ifelse statements.

library(dplyr)
library(stringr)

DF %>% 
  mutate(across(c(VAR1, VAR2), ~str_replace_all(., "^[^AA|DD]", NA_character_))) %>% 
  mutate(VAR1DATE = ifelse(is.na(VAR1), NA_character_, VAR1DATE),
         VAR2DATE = ifelse(is.na(VAR2), NA_character_, VAR2DATE))
     ID VAR1  VAR1DATE   VAR2  VAR2DATE  
  <int> <chr> <chr>      <chr> <chr>     
1     1 AABB  2001-01-01 NA    NA        
2     2 AACC  2001-01-02 AACC  2001-01-02
3     3 NA    NA         DDCC  2001-01-03
4     4 DDAA  2001-01-04 NA    NA        
5     5 NA    NA         NA    NA 

Upvotes: 3

akrun
akrun

Reputation: 886938

We may do this in two steps - loop across the columns that have 'VAR' followed by digits (\\d+) in column names, replace the values where the first two characters are not AA or DD to NA, then replace the corresponding DATE column to NA based on the NA in the 'VAR1', 'VAR2' columns

library(dplyr)
library(stringr)
DF %>%
    mutate(across(matches("^VAR\\d+$"),
        ~ replace(., !substr(., 1, 2)  %in% c("AA", "DD"), NA)), 
      across(ends_with("DATE"), 
     ~ replace(., is.na(get(str_remove(cur_column(), "DATE"))), NA)))

-output

# A tibble: 5 × 5
     ID VAR1  VAR1DATE   VAR2  VAR2DATE  
  <int> <chr> <chr>      <chr> <chr>     
1     1 AABB  2001-01-01 <NA>  <NA>      
2     2 AACC  2001-01-02 AACC  2001-01-02
3     3 <NA>  <NA>       DDCC  2001-01-03
4     4 DDAA  2001-01-04 <NA>  <NA>      
5     5 <NA>  <NA>       <NA>  <NA>      

Upvotes: 5

Rui Barradas
Rui Barradas

Reputation: 76402

Use mutate/across with the assignment function is.na<-.

DF %>%
  mutate(across(starts_with("VAR"), \(x) `is.na<-`(x, !grepl("^AA|^DD", x))))
## A tibble: 5 x 5
#     ID VAR1  VAR1DATE VAR2  VAR2DATE
#  <int> <chr> <chr>    <chr> <chr>   
#1     1 AABB  NA       NA    NA      
#2     2 AACC  NA       AACC  NA      
#3     3 NA    NA       DDCC  NA      
#4     4 DDAA  NA       NA    NA      
#5     5 NA    NA       NA    NA      

Or simpler:

DF %>%
  mutate(across(starts_with("VAR"), ~`is.na<-`(., !grepl("^AA|^DD", .))))

Upvotes: 3

Related Questions