Elsa Li
Elsa Li

Reputation: 775

How to split a string by continuous same letter in R

I have a string like this:

s <- "aaehhhhhhhaannd"

How can I split the string into the following format with R?

c("aa", "e", "hhhhhhh", "aa","nn","d") 

Upvotes: 3

Views: 126

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626728

You may use a base R strsplit with a PCRE regex based on lookarounds.

s <- "aaehhhhhhhaannd"
strsplit(s, "(?<=(.))(?!\\1)", perl=TRUE)
# [[1]]
# [1] "aa"      "e"       "hhhhhhh" "aa"      "nn"      "d"      

See the R demo online and a regex demo.

Regex details:

  • (?<=(.)) - a positive lookbehind ((?<=...)) that "looks" left and captures any char into Group 1 with (.) capturing group (this value can be referred to from inside the pattern with the help of a \1 backreference)
  • (?!\\1) - a negative lookahead that fails the match if there is the same value as captured into Group 1 immediately to the right of the current location.

Since the lookarounds are not consuming text, the split occurs at the location between different characters.

NOTE: If you want . to match a newline, too, add (?s) at the start of the pattern (as in PCRE regex, . does not match line breaks by default).

Upvotes: 3

akuiper
akuiper

Reputation: 214917

You can use str_extract_all, with regex (.)\\1* which uses back reference to match repetitive characters:

library(stringr)
str_extract_all("aaehhhhhhhaannd", "(.)\\1*")
#[[1]]
#[1] "aa"      "e"       "hhhhhhh" "aa"      "nn"      "d"

Upvotes: 3

Related Questions