JoshDG
JoshDG

Reputation: 3931

R: Regular Expressions in R - Multi string extraction

Say I have a string like this:

[1] "<u>Degradation:</u> AGL, PGM1, PGM2, PGM3, PYGL, PYGM.<br>\n"

I want to extract each of these gene IDs into a vector. I could probably use strsplit in this case, but I want to do this with regex as I will later have more complex cases. Say I want to extract all strings that contain '[A-Z0-9]{2,} (if it contains any combinations of at least two capital letters and numbers then I want it).

Thoughts?

Upvotes: 0

Views: 797

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269644

1) strapply

strapply in the gsubfn package can do that:

library(gsubfn)
x <- "<u>Degradation:</u> AGL, PGM1, PGM2, PGM3, PYGL, PYGM.<br>\n"
strapply(x, "[A-Z0-9]{2,}", c)

2) strapplyc

Also there is a high speed version specialized to use c in the development repo.

library(gsubfn)
# download and read in strapplyc
source("http://gsubfn.googlecode.com/svn/trunk/R/strapplyc.R")
strapplyc(x, "[A-Z0-9]{2,}")

Also see this example of extracting all the words from James Joyce's Ulysses here .

Choosing

strapply has a lot of variations to it so if flexibility is most important then it might be a good choice. On the other hand, strapplyc might be particularly useful if your strings are very long so that speed is important and you only need to extract strings.

Upvotes: 2

Fojtasek
Fojtasek

Reputation: 3591

The stringr package makes this kind of thing pretty easy.

> library(stringr)
> x <- "<u>Degradation:</u> AGL, PGM1, PGM2, PGM3, PYGL, PYGM.<br>\n"
> str_extract_all(x, '[A-Z0-9]{2,}')
[[1]]
[1] "AGL"  "PGM1" "PGM2" "PGM3" "PYGL" "PYGM"

Upvotes: 3

Related Questions