Statistican
Statistican

Reputation: 33

Search across multiple columns with a regular expression and extract the match

I have the following data structure and I want to check the presence of a specific string ('^881') across the columns var1 to var4. And to write the result of the match in a new column (RESULT), or an empty result, if the string is not found.

var1 var2 var3 var4 EXPECTED RESULT
8810 1234 8810
1234 2345 3456
1234 8812 2345 3456 8812

I have a solution that gives me the presence of the regex-string in a 0/1-format: (Maybe there is a more concise option?!)

library(tidyverse)

df <-
  tibble(
     var1 = c('8810', '1234', '1234'),
     var2 = c('1234', '2345', '8812'),
     var3 = c('', '3456', '2345'),
     var4 = c('', '', '3456')
  )

# create a new column, that indicates whether the string is present in the row (1) or not (0)
df %>% 
  mutate(FOUND = if_else( 
    across( starts_with('var'), 
            ~ str_detect(.x, '^881') 
    ) %>% 
    rowSums(., na.rm=T) > 0, 1, 0))

But I need the result of the match, that is the whole string. After some puzzling, I have found the following solution, but it has a decisive drawback. It is quite slow with large datasets (10m+ cases) and about 100 variables to search. Furthermore, it seems a bit cumbersome to me.

df %>% 
  mutate(across(starts_with('var'),
                ~ str_match(.x, '(^881.+)')[,1]  # gives a tibble
  ) %>%   
  mutate(across(everything(),    
                stringi::stri_na2empty)   # replace the NAs with empty strings
  ) %>%  
  unite('RESULT', sep = "")) # make a single column

My question: Is there a more concise way to perform this data manipulation (preferably with dplyr)?

Upvotes: 2

Views: 1578

Answers (3)

ThomasIsCoding
ThomasIsCoding

Reputation: 101373

A base R option

inds <- grepl("^881", as.matrix(df))
df$EXPECTED_RESULT <- replace(
  rep("", nrow(df)),
  rowSums(`dim<-`(inds, dim(df))) > 0,
  as.matrix(df)[inds]
)

gives

> df
# A tibble: 3 x 5
  var1  var2  var3   var4   EXPECTED_RESULT
  <chr> <chr> <chr>  <chr>  <chr>
1 8810  1234  ""     ""     "8810"
2 1234  2345  "3456" ""     ""
3 1234  8812  "2345" "3456" "8812"

Upvotes: 0

akrun
akrun

Reputation: 887118

Using rowwise

library(dplyr)
library(stringr)
df %>%
    rowwise %>%
    mutate(result = str_subset(c_across(everything()), "^881")[1]) %>%
    ungroup
# A tibble: 3 x 5
  var1  var2  var3   var4   result
  <chr> <chr> <chr>  <chr>  <chr> 
1 8810  1234  ""     ""     8810  
2 1234  2345  "3456" ""     <NA>  
3 1234  8812  "2345" "3456" 8812  

Upvotes: 0

tmfmnk
tmfmnk

Reputation: 39858

One option with the addition of purrr could be:

df %>%
 mutate(result = pmap_chr(across(everything()), ~ toString(c(...)[str_detect(c(...), "^881")])))

  var1  var2  var3   var4   result
  <chr> <chr> <chr>  <chr>  <chr> 
1 8810  1234  ""     ""     "8810"
2 1234  2345  "3456" ""     ""    
3 1234  8812  "2345" "3456" "8812"

Some additional performance improvement could be achieved by replacing the regex by exact matching:

df %>%
 mutate(result = pmap_chr(across(everything()), ~ toString(c(...)[str_sub(c(...), 1, 3) == "881"])))

Upvotes: 1

Related Questions