Reputation: 1
I have data that looks like "TAGCAGaaccgtaAGTCAAgcgta" that I would like to split by the divide between upper and lowercase characters. So my output would be a list of uppercase strings "TAGCAG" and "AGTCAA" and lowercase strings "aaccgta" and "gcgta"
I have tried
str <- c("TAGCAGaaccgtaAGTCAAgcgta")
library(stringr)
str_extract(str, '[[:lower:]]+')
str_extract(str, '[[:upper:]]+')
but this only gives me the first instance of uppercase or lowercase. I would like to be able to get all of the instances in a list or dataframe of each.
Upvotes: 0
Views: 759
Reputation: 8402
Like @Calum You said, str_extract_all
returns all instance of the matched pattern:
str_extract_all(str, '[[:lower:]]+')
[1]]
[1] "aaccgta" "gcgta"
str_extract_all(str, '[[:upper:]]+')
[[1]]
[1] "TAGCAG" "AGTCAA"
Or you can use the |
regex to subset for both upper and lower case strings at the same time.
str_extract_all(str, '[[:lower:]]+|[[:upper:]]+')
[[1]]
[1] "TAGCAG" "aaccgta" "AGTCAA" "gcgta"
You can unlist()
the output to return vectors.
Upvotes: 3
Reputation: 1718
In base, we can do this by combining gregexpr()
with regmatches()
:
m <- gregexpr("[[:upper:]]+|[[:lower:]]+", str)
regmatches(str, m)
Console:
[[1]]
[1] "TAGCAG" "aaccgta" "AGTCAA" "gcgta"
Upvotes: 1