Science Bagel
Science Bagel

Reputation: 1

Separating uppercase and lowercase characters in a string

I have data that looks like "TAGCAGaaccgtaAGTCAAgcgta" that I would like to split by the divide between upper and lowercase characters. So my output would be a list of uppercase strings "TAGCAG" and "AGTCAA" and lowercase strings "aaccgta" and "gcgta"

I have tried

str <- c("TAGCAGaaccgtaAGTCAAgcgta")
library(stringr)
str_extract(str, '[[:lower:]]+')
str_extract(str, '[[:upper:]]+')

but this only gives me the first instance of uppercase or lowercase. I would like to be able to get all of the instances in a list or dataframe of each.

Upvotes: 0

Views: 759

Answers (2)

Rich Pauloo
Rich Pauloo

Reputation: 8402

Extract into separate vectors:

Like @Calum You said, str_extract_all returns all instance of the matched pattern:

str_extract_all(str, '[[:lower:]]+')
[1]]
[1] "aaccgta" "gcgta"  

str_extract_all(str, '[[:upper:]]+')
[[1]]
[1] "TAGCAG" "AGTCAA"

Extract in one vector:

Or you can use the | regex to subset for both upper and lower case strings at the same time.

str_extract_all(str, '[[:lower:]]+|[[:upper:]]+')
[[1]]
[1] "TAGCAG"  "aaccgta" "AGTCAA"  "gcgta" 

You can unlist() the output to return vectors.

Upvotes: 3

JdeMello
JdeMello

Reputation: 1718

In base, we can do this by combining gregexpr() with regmatches():

m <- gregexpr("[[:upper:]]+|[[:lower:]]+", str)

regmatches(str, m)

Console:

[[1]]
[1] "TAGCAG"  "aaccgta" "AGTCAA"  "gcgta"  

Upvotes: 1

Related Questions