Reputation: 3931
Say I have a string like this:
[1] "<u>Degradation:</u> AGL, PGM1, PGM2, PGM3, PYGL, PYGM.<br>\n"
I want to extract each of these gene IDs into a vector. I could probably use strsplit in this case, but I want to do this with regex as I will later have more complex cases. Say I want to extract all strings that contain '[A-Z0-9]{2,} (if it contains any combinations of at least two capital letters and numbers then I want it).
Thoughts?
Upvotes: 0
Views: 797
Reputation: 269644
1) strapply
strapply
in the gsubfn package can do that:
library(gsubfn)
x <- "<u>Degradation:</u> AGL, PGM1, PGM2, PGM3, PYGL, PYGM.<br>\n"
strapply(x, "[A-Z0-9]{2,}", c)
2) strapplyc
Also there is a high speed version specialized to use c
in the development repo.
library(gsubfn)
# download and read in strapplyc
source("http://gsubfn.googlecode.com/svn/trunk/R/strapplyc.R")
strapplyc(x, "[A-Z0-9]{2,}")
Also see this example of extracting all the words from James Joyce's Ulysses here .
Choosing
strapply
has a lot of variations to it so if flexibility is most important then it might be a good choice. On the other hand, strapplyc
might be particularly useful if your strings are very long so that speed is important and you only need to extract strings.
Upvotes: 2
Reputation: 3591
The stringr
package makes this kind of thing pretty easy.
> library(stringr)
> x <- "<u>Degradation:</u> AGL, PGM1, PGM2, PGM3, PYGL, PYGM.<br>\n"
> str_extract_all(x, '[A-Z0-9]{2,}')
[[1]]
[1] "AGL" "PGM1" "PGM2" "PGM3" "PYGL" "PYGM"
Upvotes: 3