Reputation: 2031
I have string inputs in the following format:
my.strings <- c("FACT11", "FACT11:FACT20", "FACT1sometext:FACT20", "FACT1text with spaces:FACT20", "FACT14:FACT20", "FACT1textAnd1312:FACT2etc", "FACT12:FACT22:FACT31")
I would like to extract all the "FACT"s and the first number following FACT. So the result from this example would be:
c("FACT1", "FACT1 FACT2", "FACT1 FACT2", "FACT1 FACT2", "FACT1 FACT2", "FACT1 FACT2", "FACT1 FACT2 FACT3")
Alternatively, the result could be a list, where each element of the list is a vector with 1 up to 3 items.
What I got so far is:
gsub("(FACT[1-3]).*?:(FACT[1-3]).*", '\\1 \\2', my.strings)
# [1] "FACT11" "FACT1 FACT2 " "FACT1 FACT2 " "FACT1 FACT2 " "FACT1 FACT2 " "FACT1 FACT2 "
# [7] "FACT1 FACT2 " "FACT1 FACT2 "
It kinda looks good, except for the "FACT11" for the first element instead of "FACT1" (dropping the second "1"), and missing the "FACT3" for the last element of my.strings
. But adding another group to gsub
somehow messes the whole thing up.
gsub("(FACT[1-3]).*?:(FACT[1-3]).*?:(FACT[1-3]).*?", '\\1 \\2 \\3', my.strings)
# [1] "FACT11" "FACT11:FACT20" "FACT1sometext:FACT20"
# [4] "FACT1text with spaces:FACT20" "FACT14:FACT20" "FACT1textAnd1312:FACT2etc"
# [7] "FACT12:FACT21" "FACT1 FACT2 FACT31"
So how can I properly extract the groups?
Upvotes: 4
Views: 171
Reputation: 33488
Another base R alternative:
This solution uses the fact the FACT
end in a one-digit number.
my.strings %>%
gsub("(\\d)\\d*", "\\1:", ., perl = TRUE) %>%
strsplit(":") %>%
sapply(function(x) paste(x[grepl("FACT", x)], collapse = " "))
[1] "FACT1" "FACT1 FACT2" "FACT1 FACT2" "FACT1 FACT2"
[5] "FACT1 FACT2" "FACT1 FACT2" "FACT1 FACT2 FACT3"
Upvotes: 0
Reputation: 887148
An option would be str_extract_all
from stringr
to extract all the 'FACT' substring followed by one digit that can be 1 to 3 ([1-3]
) into a list
of vector
s. Then, map
through the list
elements and paste
the vector
s to a single strings
library(tidyverse)
str_extract_all(my.strings, "FACT[1-3]") %>%
map_chr(paste, collapse= ' ')
#[1] "FACT1" "FACT1 FACT2" "FACT1 FACT2"
#[4] "FACT1 FACT2" "FACT1 FACT2" "FACT1 FACT2"
#[7] "FACT1 FACT2 FACT3"
Or using gsub
from base R
gsub("\\s{2,}", " ", trimws(gsub("(FACT[1-3])(*SKIP)(*FAIL)|.",
" ", my.strings, perl = TRUE)))
#[1] "FACT1" "FACT1 FACT2" "FACT1 FACT2"
#[4] "FACT1 FACT2" "FACT1 FACT2" "FACT1 FACT2"
#[7] "FACT1 FACT2 FACT3"
Upvotes: 4
Reputation: 626870
You may use a base R approach, too:
> m <- regmatches(my.strings, gregexpr("FACT[1-3]", my.strings))
> sapply(m, paste, collapse=" ")
[1] "FACT1"
[2] "FACT1 FACT2"
[3] "FACT1 FACT2"
[4] "FACT1 FACT2"
[5] "FACT1 FACT2"
[6] "FACT1 FACT2"
[7] "FACT1 FACT2 FACT3"
Extract all matches with your FACT[1-3]
(or FACT[0-9]
, or FACT\\d
) pattern, and then "join" them with a space.
Upvotes: 5