Chris Ruehlemann
Chris Ruehlemann

Reputation: 21400

How to match distinct repeated characters

I'm trying to come up with a regex in R to match strings in which there is repetition of two distinct characters.

x <- c("aaaaaaah" ,"aaaah","ahhhh","cooee","helloee","mmmm","noooo","ohhhh","oooaaah","ooooh","sshh","ummmmm","vroomm","whoopee","yippee")

This regex matches all of the above, including strings such as "mmmm" and "ohhhh" where the repeated letter is the same in the first and the second repetition:

grep(".*([a-z])\\1.*([a-z])\\2", x, value = T)

What I'd like to match in x are these strings where the repeated letters are distinct:

"cooee","helloee","oooaaah","sshh","vroomm","whoopee","yippee"

How can the regex be tweaked to make sure the second repeated character is not the same as the first?

Upvotes: 5

Views: 204

Answers (2)

s_baldur
s_baldur

Reputation: 33498

If you can avoid regex altogether, then I think that's the way to go. A rough example:

nrep <- sapply(
  strsplit(x, ""), 
  function(y) {
     run_lengths <- rle(y)
     length(unique(run_lengths$values[run_lengths$lengths >= 2]))
   }
)
x[nrep > 1]
# [1] "cooee"   "helloee" "oooaaah" "sshh"    "vroomm"  "whoopee" "yippee"

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626699

You may restrict the second char pattern with a negative lookahead:

grep(".*([a-z])\\1.*(?!\\1)([a-z])\\2", x, value=TRUE, perl=TRUE)
#                    ^^^^^

See the regex demo.

(?!\\1)([a-z]) means match and capture into Group 2 any lowercase ASCII letter if it is not the same as the value in Group 1.

R demo:

x <- c("aaaaaaah" ,"aaaah","ahhhh","cooee","helloee","mmmm","noooo","ohhhh","oooaaah","ooooh","sshh","ummmmm","vroomm","whoopee","yippee")
grep(".*([a-z])\\1.*(?!\\1)([a-z])\\2", x, value=TRUE, perl=TRUE)
# => "cooee"   "helloee" "oooaaah" "sshh"    "vroomm"  "whoopee" "yippee" 

Upvotes: 4

Related Questions