how to extract string in R up to the first (and not to the last) occurance of a character?

Question

I have a string

"Signal recognition particle subunit SRP72 OS=Homo sapiens OX=9606 GN=SRP72 PE=1 SV=3"

and I would like to extract

"SRP72"

I am trying to use str_extract(), but it extracts pattern up to the last space and not to the first occurrence

str_extract(string = "Signal recognition particle subunit SRP72 OS=Homo sapiens OX=9606 GN=SRP72 PE=1 SV=3", 
        pattern = "(GN=).*( )")

thus, the pattern I get is "GN=SRP72 PE=1 ". If possible could you please give an answer with str_extract () function?

Ronak Shah · Accepted Answer

Since you don't want to extract 'GN=' in the final output we can make use lookbehind regex and extract the first word (\w+) after occurrence of "GN=".

string = "Signal recognition particle subunit SRP72 OS=Homo sapiens OX=9606 GN=SRP72 PE=1 SV=3"
stringr::str_extract(string, pattern = "(?<=GN=)\w+")
#[1] "SRP72"

In base R, we can use sub :

sub('.*GN=(\w+).*', '\1', string)

how to extract string in R up to the first (and not to the last) occurance of a character?

Answers (2)

data

Related Questions