Reputation: 109844
I can capture repeated words pretty easily using:
"(?i)\\b(\\w+)(((\\.{3}\\s*|,\\s+)*|\\s+)\\1)+\\b"
but this regex does not seem to extend to mutipe words (and why should it in its current state). How could I find repeated phrases using regex?
Here I extract repeated terms (regardless of case) but the same regex doesn't word to extract a repeated phrase:
library(qdapRegex)
rm_default("this is a big Big deal", pattern = "(?i)\\b(\\w+)(((\\.{3}\\s*|,\\s+)*|\\s+)\\1)+\\b", extract=TRUE)
rm_default("this is a big is a Big deal", pattern = "(?i)\\b(\\w+)(((\\.{3}\\s*|,\\s+)*|\\s+)\\1)+\\b", extract=TRUE)
I would hope for a regex that would return:
"is a big is a Big"
for:
x <- "this is a big is a Big deal"
To cover corner cases here's a larger desired test and output...
"this is a big is a Big deal",
"I want want to see",
"I want, want to see",
"I want...want to see see how",
"this is a big is a Big deal for those of, those of you who are.",
"I like it. It is cool",
)
[[1]]
[1] "is a big is a Big"
[[2]]
[1] "want want"
[[3]]
[1] "want, want"
[[4]]
[1] "want...want" "see see"
[[5]]
[1] "is a big is a Big" "those of, those of"
[[6]]
[1] NA
My current regex only gets me to:
rm_default(y, pattern = "(?i)\\b(\\w+)(((\\.{3}\\s*|,\\s+)*|\\s+)\\1)+\\b", extract=TRUE)
## [[1]]
## [1] NA
##
## [[2]]
## [1] "want want"
##
## [[3]]
## [1] "want, want"
##
## [[4]]
## [1] "want...want" "see see"
##
## [[5]]
## [1] NA
Upvotes: 5
Views: 288
Reputation: 269471
Try this:
> regmatches(x, gregexpr("(?i)\\b(\\S.*\\S)[ ,.]*\\b(\\1)", x, perl = TRUE))
[[1]]
[1] "is a big is a Big"
[[2]]
[1] "want want"
[[3]]
[1] "want, want"
[[4]]
[1] "want...want" "see see"
[[5]]
[1] "is a big is a Big" "those of, those of"
Here is a visualization (except there is an error in the visualization - the \S
parts should be within the group.
(?i)\b(\S.*\S)[ ,.]*\b(\1)
You might want to replace [ ,.]
with [ [:punct:]]
. I did not do that since debuggex does not support POSIX character groups.
Upvotes: 1
Reputation: 52637
I think this does what you want (note we only allow a single space, ...
, or ,
as separators, but you should be able to tweak that easily):
pattern <- "(?i)\\b(\\w.*)((?:\\s|\\.{3}|,)+\\1)+\\b"
rm_default(x, pattern = pattern, extract=TRUE)
Produces:
[[1]]
[1] "is a big is a Big"
[[2]]
[1] "want want"
[[3]]
[1] "want, want"
[[4]]
[1] "want...want" "see see"
[[5]]
[1] "is a big is a Big" "those of, those of"
Upvotes: 3