Reputation: 165
In R I want to extract a substring from within a pattern in a character
vector. The first few entries (400 in total) from my character vector x
looks like this:
x <- c(
">104K_THEPA | FPrate:0.000 | OMEGA:D-904",
">2MMP_ARATH | FPrate:0.006 | OMEGA:S-349",
">5MMP_ARATH | FPrate:0.018 | OMEGA:S-337",
">5NTD_DIPOM | FPrate:0.026 | OMEGA:S-552",
">5NTD_HUMAN | FPrate:0.154 | OMEGA:S-549",
">5NTD_MOUSE | FPrate:1.000 | OMEGA:S-551"
)
I want to extract the 4 digits following FPrate:
and eventually also the letters following OMEGA:
and the last 3 digits.
I'm new to using regular expressions and have spent hours figuring this out and searching the web for an solution, but with no luck.
The desired output would be:
[1] "0.000"
[2] "0.006"
[3] "0.018"
[4] "0.026"
[5] "0.154"
[6] "1.000"
So far I've come up with this line of code:
gsub("^[^(FPrate:)]*(FPrate:)|(\\s\\|\\sOMEGA:)[^(\\s\\|\\sOMEGA:)]*$", "", x)
which works for some of my entries, but not all.
What is the best way to achieve this?
Upvotes: 2
Views: 2191
Reputation: 269471
Here are some base R solutions:
1) If you just need the FPrate field (which is all that the question seems to ask for) then this sub
will do. No packages are needed.
as.numeric(sub(".*FPrate:(\\S+) .*", "\\1", x))
## [1] 0.000 0.006 0.018 0.026 0.154 1.000
2) If you want to parse out all name:value fields then, again, using only base R replace leading non-spaces with a newline and then replace each occurrence of space-character-space with a newline as well. It is now in dcf format so read it in using read.dcf giving character matrix m
. That may be good enough but if you want a data frame with each column appropriately type converted then convert it to the data frame d
and apply type.convert
. This solution is quite general since it does not hard code FPrate and OMEGA.
s <- gsub(" . ", "\n", sub("\\S+", "\n", x))
m <- read.dcf(textConnection(s))
d <- as.data.frame(m, stringsAsFactors = FALSE)
d[] <- lapply(d, type.convert)
giving:
> m
FPrate OMEGA
[1,] "0.000" "D-904"
[2,] "0.006" "S-349"
[3,] "0.018" "S-337"
[4,] "0.026" "S-552"
[5,] "0.154" "S-549"
[6,] "1.000" "S-551"
> d
FPrate OMEGA
1 0.000 D-904
2 0.006 S-349
3 0.018 S-337
4 0.026 S-552
5 0.154 S-549
6 1.000 S-551
3) This one uses strcapture
and produces a data frame with types converted according to proto
:
proto <- data.frame(FPrate = numeric(0), OMEGA = character(0))
strcapture(".*FPrate:(\\S+) . OMEGA:(\\S+)", x, proto)
giving:
FPrate OMEGA
1 0.000 D-904
2 0.006 S-349
3 0.018 S-337
4 0.026 S-552
5 0.154 S-549
6 1.000 S-551
4) In this one we replace the colons with spaces, read in what is left with read.table, extract the columns we want and then set the column names. No regular expressions are used.
d <- read.table(text = chartr(":", " ", x), as.is = TRUE)[c(4, 7)]
names(d) <- c("FPrate", "OMEGA")
giving this data frame:
FPrate OMEGA
1 0.000 D-904
2 0.006 S-349
3 0.018 S-337
4 0.026 S-552
5 0.154 S-549
6 1.000 S-551
5) This solutions uses the gsubfn package.
library(gsubfn)
pat <- ".*FPrate:(\\S+).*OMEGA:(\\S+)"
nms <- c("FPrate", "OMEGA")
read.pattern(text = x, pattern = pat, as.is = TRUE, col.names = nms)
giving:
FPrate OMEGA
1 0.000 D-904
2 0.006 S-349
3 0.018 S-337
4 0.026 S-552
5 0.154 S-549
6 1.000 S-551
Upvotes: 2
Reputation: 9865
Solution using pure r-base
xx <- strsplit(x, " \\| ")
first.numbers <- sapply(xx, function(x) gsub("FPrate:", "", x[2]))
letters <- sapply(xx, function(x) gsub("OMEGA:(.?)-\\d+", "\\1", x[[3]]))
last.digits <- sapply(xx, function(x) gsub("OMEGA:.?-(\\d+)", "\\1", x[[3]]))
Explanation
If you want to stick with r-base, I realized that gsub
is very versatile in R. You can even capture groups with it.
In this example, to keep things simple, I would first strsplit
the things by " | ":
xx <- strsplit(x, " \\| ", perl=TRUE)
If you look at xx
now:
> xx
[[1]]
[1] ">104K_THEPA" "FPrate:0.000" "OMEGA:D-904"
[[2]]
[1] ">2MMP_ARATH" "FPrate:0.006" "OMEGA:S-349"
[[3]]
[1] ">5MMP_ARATH" "FPrate:0.018" "OMEGA:S-337"
[[4]]
[1] ">5NTD_DIPOM" "FPrate:0.026" "OMEGA:S-552"
[[5]]
[1] ">5NTD_HUMAN" "FPrate:0.154" "OMEGA:S-549"
[[6]]
[1] ">5NTD_MOUSE" "FPrate:1.000" "OMEGA:S-551"
So you could select only the second or third elements, and propagate through the list using sapply
(which is in this case equivalent to unlist(lapply(...))
and returns you at the end a vector.
To capture the first numbers, I would do then:
first.numbers <- sapply(xx, function(x) gsub("FPrate:", "", x[2]))
first.numbers
## [1] "0.000" "0.006" "0.018" "0.026" "0.154" "1.000"
Here, I just removed "FPrate:". I could also catch the numbers through grouping. I will do it for the next captures:
letters <- sapply(xx, function(x) gsub("OMEGA:(.?)-\\d+", "\\1", x[[3]]))
letters
## [1] "D" "S" "S" "S" "S" "S"
Note, here I match the entire expression of the third elements by "OMEGA:(.?)-\\d+"
but capture with the grouping ()
only one position (zero or one, but it will take one due to greediness).
And the interesting thing is what I give for replacement for the entire expression: "\\1"
- that what is captured for the first group. So in gsub
, you can use references to groups \\1
, \\2
, etc. depending of how many groupings you added.
So we can capture the last digits:
last.digits <- sapply(xx, function(x) gsub("OMEGA:.?-(\\d+)", "\\1", x[[3]]))
last.digits
## [1] "904" "349" "337" "552" "549" "551"
gsub()
isn't so bad after all, isn't it?
Upvotes: 1
Reputation: 12155
Using the str_match
function from stringr
to extract only a specific part of the match (a matching group) will make your problem much easier:
stringr::str_match(vec, 'FPrate:([^ ]*).*OMEGA:([^ ]*)')[,c(2,3)]
[,1] [,2]
[1,] "0.000" "D-904"
[2,] "0.006" "S-349"
[3,] "0.018" "S-337"
[4,] "0.026" "S-552"
[5,] "0.154" "S-549"
[6,] "1.000" "S-551"
str_match
matches the regex and returns a data frame: the first column is the whole match, while each following column is the contents of the parentheses in your regex in sequential order. So by taking the 2nd and 3rd columns, we get just the non-whitespace sequence following 'FPrate:'
and following 'OMEGA:'
.
You can add as many capturing groups as you want. For example, if you want to split the OMEGA
into a letter and number, just use more groups:
stringr::str_match(vec, 'FPrate:([^ ]*).*OMEGA:([[:alnum:]])-(\\d*)')[,c(2:4)]
[,1] [,2] [,3]
[1,] "0.000" "D" "904"
[2,] "0.006" "S" "349"
[3,] "0.018" "S" "337"
[4,] "0.026" "S" "552"
[5,] "0.154" "S" "549"
[6,] "1.000" "S" "551"
Upvotes: 2
Reputation: 78792
This uses the un-crutched stringi
ops that stringr
masks from you along with a readable/documented regex:
library(stringi)
library(tidyverse)
Your data:
c(
">104K_THEPA | FPrate:0.000 | OMEGA:D-904",
">2MMP_ARATH | FPrate:0.006 | OMEGA:S-349",
">5MMP_ARATH | FPrate:0.018 | OMEGA:S-337",
">5NTD_DIPOM | FPrate:0.026 | OMEGA:S-552",
">5NTD_HUMAN | FPrate:0.154 | OMEGA:S-549",
">5NTD_MOUSE | FPrate:1.000 | OMEGA:S-551"
) -> xdat
The extraction:
stri_match_first_regex(
xdat,
"
FPrate:([[:digit:]]\\.[[:digit:]]+) # this grabs the FPrate amount
.* # this skips a bit generically just in case it ever differs
OMEGA:([[:alnum:]]-[[:digit:]]+) # this grabs the OMEGA info
",
opts_regex = stri_opts_regex(comments = TRUE)
)[,2:3] %>%
as_data_frame() %>%
mutate(V1 = as.numeric(V1), V2 = stri_replace_first_fixed(V2, "-", ""))
## # A tibble: 6 x 2
## V1 V2
## <dbl> <chr>
## 1 0 D904
## 2 0.006 S349
## 3 0.018 S337
## 4 0.026 S552
## 5 0.154 S549
## 6 1 S551
Also: rly good attempt at the regex in the question. Regexes are not pretty and often don't make alot of sense until you use them for a while.
Upvotes: 2