bobbel
bobbel

Reputation: 2031

How can I extract these multiple regex groups in R

I have string inputs in the following format:

my.strings <- c("FACT11", "FACT11:FACT20", "FACT1sometext:FACT20", "FACT1text with spaces:FACT20", "FACT14:FACT20", "FACT1textAnd1312:FACT2etc", "FACT12:FACT22:FACT31")

I would like to extract all the "FACT"s and the first number following FACT. So the result from this example would be:

c("FACT1", "FACT1 FACT2", "FACT1 FACT2", "FACT1 FACT2", "FACT1 FACT2", "FACT1 FACT2", "FACT1 FACT2 FACT3")

Alternatively, the result could be a list, where each element of the list is a vector with 1 up to 3 items.

What I got so far is:

gsub("(FACT[1-3]).*?:(FACT[1-3]).*", '\\1 \\2', my.strings)
# [1] "FACT11"       "FACT1 FACT2 " "FACT1 FACT2 " "FACT1 FACT2 " "FACT1 FACT2 " "FACT1 FACT2 "
# [7] "FACT1 FACT2 " "FACT1 FACT2 "

It kinda looks good, except for the "FACT11" for the first element instead of "FACT1" (dropping the second "1"), and missing the "FACT3" for the last element of my.strings. But adding another group to gsub somehow messes the whole thing up.

gsub("(FACT[1-3]).*?:(FACT[1-3]).*?:(FACT[1-3]).*?", '\\1 \\2 \\3', my.strings)
# [1] "FACT11"                       "FACT11:FACT20"                "FACT1sometext:FACT20"        
# [4] "FACT1text with spaces:FACT20" "FACT14:FACT20"                "FACT1textAnd1312:FACT2etc"   
# [7] "FACT12:FACT21"                "FACT1 FACT2 FACT31" 

So how can I properly extract the groups?

Upvotes: 4

Views: 171

Answers (3)

s_baldur
s_baldur

Reputation: 33488

Another base R alternative:

This solution uses the fact the FACT end in a one-digit number.

my.strings %>%  
  gsub("(\\d)\\d*", "\\1:", ., perl = TRUE) %>% 
  strsplit(":") %>%
  sapply(function(x) paste(x[grepl("FACT", x)], collapse = " "))

[1] "FACT1"             "FACT1 FACT2"       "FACT1 FACT2"       "FACT1 FACT2"      
[5] "FACT1 FACT2"       "FACT1 FACT2"       "FACT1 FACT2 FACT3"

Upvotes: 0

akrun
akrun

Reputation: 887148

An option would be str_extract_all from stringr to extract all the 'FACT' substring followed by one digit that can be 1 to 3 ([1-3]) into a list of vectors. Then, map through the list elements and paste the vectors to a single strings

library(tidyverse)
str_extract_all(my.strings, "FACT[1-3]") %>%
            map_chr(paste, collapse= ' ')
#[1] "FACT1"             "FACT1 FACT2"       "FACT1 FACT2"      
#[4] "FACT1 FACT2"       "FACT1 FACT2"       "FACT1 FACT2"      
#[7] "FACT1 FACT2 FACT3"

Or using gsub from base R

gsub("\\s{2,}", " ", trimws(gsub("(FACT[1-3])(*SKIP)(*FAIL)|.",
                       " ", my.strings, perl = TRUE)))
#[1] "FACT1"             "FACT1 FACT2"       "FACT1 FACT2"      
#[4] "FACT1 FACT2"       "FACT1 FACT2"       "FACT1 FACT2"      
#[7] "FACT1 FACT2 FACT3"

Upvotes: 4

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626870

You may use a base R approach, too:

> m <- regmatches(my.strings, gregexpr("FACT[1-3]", my.strings))
> sapply(m, paste, collapse=" ")
[1] "FACT1"            
[2] "FACT1 FACT2"      
[3] "FACT1 FACT2"      
[4] "FACT1 FACT2"      
[5] "FACT1 FACT2"      
[6] "FACT1 FACT2"      
[7] "FACT1 FACT2 FACT3"

Extract all matches with your FACT[1-3] (or FACT[0-9], or FACT\\d) pattern, and then "join" them with a space.

Upvotes: 5

Related Questions