Powege
Powege

Reputation: 705

How to split a string based on one or more occurrences of a given character?

Given the string:

string <- "AATTGGCGCTAG---AT-TTACG----"

How can I split it into strings based on the occurrence of one or more "-". For example:

string1 <- "AATTGGCGCTAG"
string2 <- "---"
string3 <- "AT"
string4 <- "-"
string5 <- "TTACG"
string6 <- "----"

I have tried:

strsplit(string, "[-]+")

However, this does not return the stings of "-"

Upvotes: 3

Views: 302

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626758

You may match them with

[^-]+|-+

See the regex demo. It matches

  • [^-]+ - 1+ chars other than -
  • | - or
  • -+ - 1 or more - chars.

In R:

x <- "AATTGGCGCTAG---AT-TTACG----"
regmatches(x, gregexpr("[^-]+|-+", x))

Or

library(stringr)
x <- "AATTGGCGCTAG---AT-TTACG----"
str_extract_all(x, "[^-]+|-+")

Output

## => [[1]]
##    [1] "AATTGGCGCTAG" "---"   "AT"  "-"   "TTACG"   "----"

Upvotes: 4

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521103

Here is a direct fix to your current attempt with strsplit:

string <- "AATTGGCGCTAG---AT-TTACG----"
strsplit(string, "(?<=[^-])(?=[-])|(?<=[-])(?=[^-])", perl=TRUE)[[1]]

[1] "AATTGGCGCTAG" "---"          "AT"           "-"            "TTACG"
[6] "----"

The idea behind the regex pattern is to split whenver one of the following two conditions be true:

  • The immediate preceding character is NOT a dash, and what follows IS a dash, or
  • The immediate preceding character IS a dash, and what follows is NOT a dash

Upvotes: 0

Related Questions