Tyler Rinker
Tyler Rinker

Reputation: 109844

regex capture repeated phrases

I can capture repeated words pretty easily using: "(?i)\\b(\\w+)(((\\.{3}\\s*|,\\s+)*|\\s+)\\1)+\\b" but this regex does not seem to extend to mutipe words (and why should it in its current state). How could I find repeated phrases using regex?

Here I extract repeated terms (regardless of case) but the same regex doesn't word to extract a repeated phrase:

library(qdapRegex)
rm_default("this is a big Big deal", pattern = "(?i)\\b(\\w+)(((\\.{3}\\s*|,\\s+)*|\\s+)\\1)+\\b", extract=TRUE)
rm_default("this is a big is a Big deal", pattern = "(?i)\\b(\\w+)(((\\.{3}\\s*|,\\s+)*|\\s+)\\1)+\\b", extract=TRUE)

I would hope for a regex that would return:

"is a big is a Big"

for:

x <- "this is a big is a Big deal"

To cover corner cases here's a larger desired test and output...

    "this is a big is a Big deal",
    "I want want to see",
    "I want, want to see",
    "I want...want to see see how",
    "this is a big is a Big deal for those of, those of you who are.",
    "I like it. It is cool",
)


[[1]]
[1] "is a big is a Big"

[[2]]
[1] "want want"

[[3]]
[1] "want, want"

[[4]]
[1] "want...want" "see see"    

[[5]]
[1] "is a big is a Big" "those of, those of"

[[6]]
[1] NA

My current regex only gets me to:

rm_default(y, pattern = "(?i)\\b(\\w+)(((\\.{3}\\s*|,\\s+)*|\\s+)\\1)+\\b", extract=TRUE)

## [[1]]
## [1] NA
## 
## [[2]]
## [1] "want want"
## 
## [[3]]
## [1] "want, want"
## 
## [[4]]
## [1] "want...want" "see see"    
## 
## [[5]]
## [1] NA

Upvotes: 5

Views: 288

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269471

Try this:

> regmatches(x, gregexpr("(?i)\\b(\\S.*\\S)[ ,.]*\\b(\\1)", x, perl = TRUE))
[[1]]
[1] "is a big is a Big"

[[2]]
[1] "want want"

[[3]]
[1] "want, want"

[[4]]
[1] "want...want" "see see"    

[[5]]
[1] "is a big is a Big"  "those of, those of"

Here is a visualization (except there is an error in the visualization - the \S parts should be within the group.

(?i)\b(\S.*\S)[ ,.]*\b(\1)

Regular expression visualization

Debuggex Demo

You might want to replace [ ,.] with [ [:punct:]]. I did not do that since debuggex does not support POSIX character groups.

Upvotes: 1

BrodieG
BrodieG

Reputation: 52637

I think this does what you want (note we only allow a single space, ..., or , as separators, but you should be able to tweak that easily):

pattern <- "(?i)\\b(\\w.*)((?:\\s|\\.{3}|,)+\\1)+\\b"
rm_default(x, pattern = pattern, extract=TRUE)

Produces:

[[1]]
[1] "is a big is a Big"

[[2]]
[1] "want want"

[[3]]
[1] "want, want"

[[4]]
[1] "want...want" "see see"    

[[5]]
[1] "is a big is a Big"  "those of, those of"

Upvotes: 3

Related Questions