AllenH
AllenH

Reputation: 63

How to split string in R with regular expression when parts of the regular expression are to be kept in the subsequent splitted strings?

I have a vector of character strings like this x = c("ABC", "ABC, EF", "ABC, DEF, 2 stems", "DE, other comments, and stuff").

I'd like to split each of these into two components: 1) the set of capital letters (2 or 3 letters, separated by commas), and 2) everything after the last "[A-Z][A-Z], ".

The results should be

 [[1]]
[1] "ABC"

[[2]]
[1] "ABC, EF"

[[3]]
[1] "ABC, DEF"      "2 stems"

[[4]]
[1] "DE"  "other comments, and stuff"

I tried strsplit(x, "[A-Z][A-Z], [a-z0-9]") and strsplit(x, "(?:[A-Z][A-Z], )[a-z0-9]"), both of which returned

[[1]]
[1] "ABC"

[[2]]
[1] "ABC, EF"

[[3]]
[1] "ABC, D"      " stems"

[[4]]
[1] ""                        "ther comments, and stuff"

The identification of where to split depends on a combination of the end of the first substring and the beginning of the second substring, and so those parts get excluded from the final result.

Any help appreciated in splitting as indicated above while including the relevant parts of the split regex in each substring!

Upvotes: 1

Views: 31

Answers (1)

akrun
akrun

Reputation: 887158

One option would be str_split

library(stringr)
str_split(x, ", (?=[a-z0-9])", n = 2)
#[[1]]
#[1] "ABC"

#[[2]]
#[1] "ABC, EF"

#[[3]]
#[1] "ABC, DEF" "2 stems" 

#[[4]]
#[1] "DE"                        "other comments, and stuff"

Upvotes: 1

Related Questions