Reputation: 33
I have the following data structure and I want to check the presence of a specific string ('^881') across the columns var1 to var4. And to write the result of the match in a new column (RESULT), or an empty result, if the string is not found.
var1 | var2 | var3 | var4 | EXPECTED RESULT |
---|---|---|---|---|
8810 | 1234 | 8810 | ||
1234 | 2345 | 3456 | ||
1234 | 8812 | 2345 | 3456 | 8812 |
I have a solution that gives me the presence of the regex-string in a 0/1-format: (Maybe there is a more concise option?!)
library(tidyverse)
df <-
tibble(
var1 = c('8810', '1234', '1234'),
var2 = c('1234', '2345', '8812'),
var3 = c('', '3456', '2345'),
var4 = c('', '', '3456')
)
# create a new column, that indicates whether the string is present in the row (1) or not (0)
df %>%
mutate(FOUND = if_else(
across( starts_with('var'),
~ str_detect(.x, '^881')
) %>%
rowSums(., na.rm=T) > 0, 1, 0))
But I need the result of the match, that is the whole string. After some puzzling, I have found the following solution, but it has a decisive drawback. It is quite slow with large datasets (10m+ cases) and about 100 variables to search. Furthermore, it seems a bit cumbersome to me.
df %>%
mutate(across(starts_with('var'),
~ str_match(.x, '(^881.+)')[,1] # gives a tibble
) %>%
mutate(across(everything(),
stringi::stri_na2empty) # replace the NAs with empty strings
) %>%
unite('RESULT', sep = "")) # make a single column
My question: Is there a more concise way to perform this data manipulation (preferably with dplyr
)?
Upvotes: 2
Views: 1578
Reputation: 101373
A base R option
inds <- grepl("^881", as.matrix(df))
df$EXPECTED_RESULT <- replace(
rep("", nrow(df)),
rowSums(`dim<-`(inds, dim(df))) > 0,
as.matrix(df)[inds]
)
gives
> df
# A tibble: 3 x 5
var1 var2 var3 var4 EXPECTED_RESULT
<chr> <chr> <chr> <chr> <chr>
1 8810 1234 "" "" "8810"
2 1234 2345 "3456" "" ""
3 1234 8812 "2345" "3456" "8812"
Upvotes: 0
Reputation: 887118
Using rowwise
library(dplyr)
library(stringr)
df %>%
rowwise %>%
mutate(result = str_subset(c_across(everything()), "^881")[1]) %>%
ungroup
# A tibble: 3 x 5
var1 var2 var3 var4 result
<chr> <chr> <chr> <chr> <chr>
1 8810 1234 "" "" 8810
2 1234 2345 "3456" "" <NA>
3 1234 8812 "2345" "3456" 8812
Upvotes: 0
Reputation: 39858
One option with the addition of purrr
could be:
df %>%
mutate(result = pmap_chr(across(everything()), ~ toString(c(...)[str_detect(c(...), "^881")])))
var1 var2 var3 var4 result
<chr> <chr> <chr> <chr> <chr>
1 8810 1234 "" "" "8810"
2 1234 2345 "3456" "" ""
3 1234 8812 "2345" "3456" "8812"
Some additional performance improvement could be achieved by replacing the regex by exact matching:
df %>%
mutate(result = pmap_chr(across(everything()), ~ toString(c(...)[str_sub(c(...), 1, 3) == "881"])))
Upvotes: 1