Reputation: 21400
I'm trying to come up with a regex in R to match strings in which there is repetition of two distinct characters.
x <- c("aaaaaaah" ,"aaaah","ahhhh","cooee","helloee","mmmm","noooo","ohhhh","oooaaah","ooooh","sshh","ummmmm","vroomm","whoopee","yippee")
This regex matches all of the above, including strings such as "mmmm" and "ohhhh" where the repeated letter is the same in the first and the second repetition:
grep(".*([a-z])\\1.*([a-z])\\2", x, value = T)
What I'd like to match in x
are these strings where the repeated letters are distinct:
"cooee","helloee","oooaaah","sshh","vroomm","whoopee","yippee"
How can the regex be tweaked to make sure the second repeated character is not the same as the first?
Upvotes: 5
Views: 204
Reputation: 33498
If you can avoid regex altogether, then I think that's the way to go. A rough example:
nrep <- sapply(
strsplit(x, ""),
function(y) {
run_lengths <- rle(y)
length(unique(run_lengths$values[run_lengths$lengths >= 2]))
}
)
x[nrep > 1]
# [1] "cooee" "helloee" "oooaaah" "sshh" "vroomm" "whoopee" "yippee"
Upvotes: 1
Reputation: 626699
You may restrict the second char pattern with a negative lookahead:
grep(".*([a-z])\\1.*(?!\\1)([a-z])\\2", x, value=TRUE, perl=TRUE)
# ^^^^^
See the regex demo.
(?!\\1)([a-z])
means match and capture into Group 2 any lowercase ASCII letter if it is not the same as the value in Group 1.
x <- c("aaaaaaah" ,"aaaah","ahhhh","cooee","helloee","mmmm","noooo","ohhhh","oooaaah","ooooh","sshh","ummmmm","vroomm","whoopee","yippee")
grep(".*([a-z])\\1.*(?!\\1)([a-z])\\2", x, value=TRUE, perl=TRUE)
# => "cooee" "helloee" "oooaaah" "sshh" "vroomm" "whoopee" "yippee"
Upvotes: 4