Reputation: 1507
I'm trying to move all occurrences of a particular pattern to the beginning of a string. For example, if the pattern is 'pat' then I'd like to my regex substitution to convert
'a pat b pat c pat d'
to
'pat pat pat a b c d'
I could achieve this by repeatedly applying
string <- gsub(x=string,pattern='(.*)(pat )(.*)',replacement='\\2\\1\\3')
to my initial string value, but this requires looping over the string an arbitrary number of times since I do not know how many times to expect the pattern to occur in the string. I also cannot simply take a greedy approach, such as applying the substitution as many times as the length of the string, since I am working with extremely long vectors of strings of varying length and applying vector substitutions.
So, is there any way to achieve this functionality with a single regex expression?
EDIT
The consensus appears to be that this cannot be done with a single regex/gsub expression. I should provide more details on why exactly this is what is needed for me when other solutions would do in more restricted cases:
I am working with a large dataset (millions of rows) containing a string field on which I wish to perform cleaning rules. These rules consist of a list of regex replacements specified in a separate file; there are a few hundred of these. The cleaning process proceeds by looping over the regex rules and applying each to the entire string column through the vectorized version of gsub
.
For some of these rules, but not all, I would like to identify all instances of a specific pattern, then move all such instances to the beginning of the string. The specified pattern will change from one rule to the other, and so no solution which leverages the particulars of the sought pattern are tenable.
It's looking like I can't achieve my goal without some serious restructuring of the process, unless anyone has any clever ideas ...
Upvotes: 2
Views: 545
Reputation: 2318
This is not a single regex experssion but you can also try stringr
package as the functions under stringr
pacakge are vectorised over string
and pattern
.
library(stringr)
my_str <- 'a pat b pat c pat d'
my_pat <- c("pat")
# Capture the sepcified pattern
s1 <- unlist(lapply(str_extract_all(my_str, my_pat), FUN = function(x) paste(x, collapse = " ")))
# Remove the captured patterns from the string
s2 <- str_remove_all(my_str, my_pat)
# Move the first pattern to the beginning
str_c(s1, s2, sep = " ")
[1] "pat pat pat a b c d"
Still works on string and pattern vectors:
library(stringr)
my_str <- c('a pat b pat c pat d', 'x pet y pet zz pet')
my_pat <- c("pat", 'pet')
# Capture the sepcified pattern
s1 <- unlist(lapply(str_extract_all(my_str, my_pat), FUN = function(x) paste(x, collapse = " ")))
# Remove the captured patterns from the string
s2 <- str_remove_all(my_str, my_pat)
# Move the first pattern to the beginning
str_c(s1, s2, sep = " ")
[1] "pat pat pat a b c d" "pet pet pet x y zz "
Upvotes: 0
Reputation: 269854
Assuming the pattern is a fixed string (which is the case in the example in the question), compute the number of times the pattern occurs and use strrep
to create that many repetitions of the pattern prepending that to the original string without the pattern:
pat <- "pat"
pats <- paste0(" *", pat, " *")
paste0(strrep(paste0(pat, " "), lengths(gregexpr(pats, x))), gsub(pats, " ", x))
## [1] "pat pat pat a b c d" "pat pat pat a b c d"
If the pattern is not a fixed string then extract it and paste it before the original string without it.
library(gsubfn)
paste(sapply(strapply(x, pat), paste, collapse = " "), gsub(pats, " ", x))
## [1] "pat pat pat a b c d" "pat pat pat a b c d"
Input data is a character vector:
x <- 'a pat b pat c pat d'
x <- c(x, x)
Upvotes: 1
Reputation: 3183
You can try something very naive as below:
s <- 'a pat b pat c pat d'
s <- unlist(strsplit(s, " "))
stringtomatch <- "pat"
paste(c(s[grepl(stringtomatch, s)], s[!grepl(stringtomatch, s)]), collapse = " ")
[1] "pat pat pat a b c d"
or look at regex
for advanced use cases
Upvotes: 1