Rookatu
Rookatu

Reputation: 1507

Regex for moving all instances of a substring to the beginning

I'm trying to move all occurrences of a particular pattern to the beginning of a string. For example, if the pattern is 'pat' then I'd like to my regex substitution to convert

'a pat b pat c pat d'

to

'pat pat pat a b c d'

I could achieve this by repeatedly applying

string <- gsub(x=string,pattern='(.*)(pat )(.*)',replacement='\\2\\1\\3')

to my initial string value, but this requires looping over the string an arbitrary number of times since I do not know how many times to expect the pattern to occur in the string. I also cannot simply take a greedy approach, such as applying the substitution as many times as the length of the string, since I am working with extremely long vectors of strings of varying length and applying vector substitutions.

So, is there any way to achieve this functionality with a single regex expression?

EDIT

The consensus appears to be that this cannot be done with a single regex/gsub expression. I should provide more details on why exactly this is what is needed for me when other solutions would do in more restricted cases:

I am working with a large dataset (millions of rows) containing a string field on which I wish to perform cleaning rules. These rules consist of a list of regex replacements specified in a separate file; there are a few hundred of these. The cleaning process proceeds by looping over the regex rules and applying each to the entire string column through the vectorized version of gsub.

For some of these rules, but not all, I would like to identify all instances of a specific pattern, then move all such instances to the beginning of the string. The specified pattern will change from one rule to the other, and so no solution which leverages the particulars of the sought pattern are tenable.

It's looking like I can't achieve my goal without some serious restructuring of the process, unless anyone has any clever ideas ...

Upvotes: 2

Views: 545

Answers (3)

MKa
MKa

Reputation: 2318

This is not a single regex experssion but you can also try stringr package as the functions under stringr pacakge are vectorised over string and pattern.

library(stringr)
my_str <- 'a pat b pat c pat d'
my_pat <- c("pat")

# Capture the sepcified pattern
s1 <- unlist(lapply(str_extract_all(my_str, my_pat), FUN = function(x) paste(x, collapse = " ")))

# Remove the captured patterns from the string
s2 <- str_remove_all(my_str, my_pat)

# Move the first pattern to the beginning
str_c(s1, s2, sep = " ")
[1] "pat pat pat a  b  c  d"

Still works on string and pattern vectors:

library(stringr)
my_str <- c('a pat b pat c pat d', 'x pet y pet zz pet')
my_pat <- c("pat", 'pet')

# Capture the sepcified pattern
s1 <- unlist(lapply(str_extract_all(my_str, my_pat), FUN = function(x) paste(x, collapse = " ")))

# Remove the captured patterns from the string
s2 <- str_remove_all(my_str, my_pat)

# Move the first pattern to the beginning
str_c(s1, s2, sep = " ")
[1] "pat pat pat a  b  c  d" "pet pet pet x  y  zz " 

Upvotes: 0

G. Grothendieck
G. Grothendieck

Reputation: 269854

Fixed string

Assuming the pattern is a fixed string (which is the case in the example in the question), compute the number of times the pattern occurs and use strrep to create that many repetitions of the pattern prepending that to the original string without the pattern:

pat <- "pat"
pats <- paste0(" *", pat, " *")

paste0(strrep(paste0(pat, " "), lengths(gregexpr(pats, x))), gsub(pats, " ", x))
## [1] "pat pat pat a b c d" "pat pat pat a b c d"

General pattern

If the pattern is not a fixed string then extract it and paste it before the original string without it.

library(gsubfn)
paste(sapply(strapply(x, pat), paste, collapse = " "), gsub(pats, " ", x))
## [1] "pat pat pat a b c d" "pat pat pat a b c d"

Note

Input data is a character vector:

x <- 'a pat b pat c pat d'
x <- c(x, x)

Upvotes: 1

Sonny
Sonny

Reputation: 3183

You can try something very naive as below:

s <- 'a pat b pat c pat d'
s <- unlist(strsplit(s, " "))
stringtomatch <- "pat"
paste(c(s[grepl(stringtomatch, s)], s[!grepl(stringtomatch, s)]), collapse = " ")
[1] "pat pat pat a b c d"

or look at regex for advanced use cases

Upvotes: 1

Related Questions