Reputation: 85

R：how to extract the exact character between two characters for a vector

dat1 <- c('human(display_long)|uniprotkb:ESR1(gene name)')
dat2 <- c('human(display_long)|uniprotkb:TP53(gene name)')
dat3 <- c('human(display_long)|uniprotkb:GPX4(gene name)')
dat4 <- c('human(display_long)|uniprotkb:ALOX15(gene name)')
dat5 <- c('human(display_long)|uniprotkb:PGR(gene name)')
dat <- c(dat1,dat2,dat3,dat4,dat5)

how to extract the gene name between 'human(display_long)|uniprotkb:' and '(gene name)' for vector dat.Thanks!

Upvotes: 2

Answers (4)

akrun

Reputation: 887028

We can use str_remove_all

library(stringr)
str_remove_all(dat, ".*uniprotkb:|\\(.*")
[1] "ESR1"   "TP53"   "GPX4"   "ALOX15" "PGR"

Or use trimws from base R

trimws(dat, whitespace = ".*uniprotkb:|\\(.*")
[1] "ESR1"   "TP53"   "GPX4"   "ALOX15" "PGR"

Upvotes: 0

GKi

Reputation: 39657

You can use regexpr and regmatches to extract the text between human(display_long)|uniprotkb: and (gene name).

regmatches(dat
 , regexpr("(?<=human\\(display_long\\)\\|uniprotkb:).*(?=\\(gene name\\))"
 , dat, perl=TRUE))
#[1] "ESR1"   "TP53"   "GPX4"   "ALOX15" "PGR"

Where (?<=human\\(display_long\\)\\|uniprotkb:) is a positive look behind for human(display_long)|uniprotkb: and (?=\\(gene name\\) is a positive look ahead for (gene name) and .* is the text in between.

Another way is to use sub but this might fail in case there is no match.

sub(".*human\\(display_long\\)\\|uniprotkb:(.*)\\(gene name\\).*", "\\1", dat)
#[1] "ESR1"   "TP53"   "GPX4"   "ALOX15" "PGR"

Other ways not searching for the full pattern might be:

regmatches(dat, regexpr("(?<=:)[^(]*", dat, perl=TRUE))
sub(".*:([^(]*).*", "\\1", dat)
sub(".*:(.*)\\(.*", "\\1", dat)

Upvotes: 1

Peter

Reputation: 12699

Using stringr and look behind you could try this:

library(stringr)
str_extract(dat, "(?<=:)[A-z0-9]+")
#[1] "ESR1"   "TP53"   "GPX4"   "ALOX15" "PGR"

Assuming that there is only one colon which precedes the gene name.

Upvotes: 0

Ronak Shah

Reputation: 388862

You can try this regex which will extract the text between 'uniprotkb' and opening round brackets (().

sub('.*uniprotkb:(\\w+)\\(.*', '\\1', dat)
#[1] "ESR1"   "TP53"   "GPX4"   "ALOX15" "PGR"

Upvotes: 0

R：how to extract the exact character between two characters for a vector

Answers (4)

Related Questions