Pavel Shliaha
Pavel Shliaha

Reputation: 935

how to extract string in R up to the first (and not to the last) occurance of a character?

I have a string

"Signal recognition particle subunit SRP72 OS=Homo sapiens OX=9606 GN=SRP72 PE=1 SV=3"

and I would like to extract

"SRP72"

I am trying to use str_extract(), but it extracts pattern up to the last space and not to the first occurrence

str_extract(string = "Signal recognition particle subunit SRP72 OS=Homo sapiens OX=9606 GN=SRP72 PE=1 SV=3", 
        pattern = "(GN=).*( )")

thus, the pattern I get is "GN=SRP72 PE=1 ". If possible could you please give an answer with str_extract () function?

Upvotes: 0

Views: 221

Answers (2)

akrun
akrun

Reputation: 887961

We can use regmatches/regexpr in base R

regmatches(string, regexpr("(?<=GN=)\\w+", string, perl = TRUE))
#[1] "SRP72"

data

string <- "Signal recognition particle subunit SRP72 OS=Homo sapiens OX=9606 GN=SRP72 PE=1 SV=3"

Upvotes: 0

Ronak Shah
Ronak Shah

Reputation: 389325

Since you don't want to extract 'GN=' in the final output we can make use lookbehind regex and extract the first word (\\w+) after occurrence of "GN=".

string = "Signal recognition particle subunit SRP72 OS=Homo sapiens OX=9606 GN=SRP72 PE=1 SV=3"
stringr::str_extract(string, pattern = "(?<=GN=)\\w+")
#[1] "SRP72"

In base R, we can use sub :

sub('.*GN=(\\w+).*', '\\1', string)

Upvotes: 1

Related Questions