M. L
M. L

Reputation: 85

Extracting string after a specific pattern in R

I want to extract strings from a list that contains identifiers of different lengths. Essentially, I want to keep all of the characters of identifiers up to 3rd occurrence of "-", except the alphabet at the end, and remove the rest. The example of the list is below:

mylist <- c("abc-nop-7a-2","abc-nop-7b-3p", "abc-nop-18a-5p/18c-5p", "abc-xyz-198_5p")

I want the resulting list to look like:

result <- c("abc-nop-7","abc-nop-7", "abc-nop-18", "abc-xyz-198")

I have tried splitting the strings and then taking the section I want, but I was not sure how to call sections up to a certain point. I tried:

mylist <- gsub("-", "_", mylist) #"-" was not not acceptable as a character
mylist <- strsplit(mylist, "_")
sapply(mylist, `[`, 3)

But of course, the above only gives me something like this:

"7","7", "18", "198"

Is there a way to call extract 1~3 section I split in the method above? or if there are more efficient ways to do the task through stringr or something, I'd appreciate that as well.

Thanks in advance.

Upvotes: 1

Views: 925

Answers (1)

akrun
akrun

Reputation: 887118

We can capture as a group and replace with the backreference (\\1)

sub("^(([^-]+-){2}[0-9]+).*", "\\1", mylist)
[1] "abc-nop-7"   "abc-nop-7"   "abc-nop-18"  "abc-xyz-198"

the pattern matched is two ({2}) instances of characters that are not a - ([^-]+) followed by a - from the start (^) of the string, followed by one or more digits ([0-9]+), captured ((...)) and in the replacement, specify the backreference of the captured group

Upvotes: 1

Related Questions