LatteMaster
LatteMaster

Reputation: 165

How to use regular expressions to extract a string from a pattern in R

In R I want to extract a substring from within a pattern in a character vector. The first few entries (400 in total) from my character vector x looks like this:

x <- c(
  ">104K_THEPA | FPrate:0.000 | OMEGA:D-904",
  ">2MMP_ARATH | FPrate:0.006 | OMEGA:S-349",
  ">5MMP_ARATH | FPrate:0.018 | OMEGA:S-337",
  ">5NTD_DIPOM | FPrate:0.026 | OMEGA:S-552",
  ">5NTD_HUMAN | FPrate:0.154 | OMEGA:S-549",
  ">5NTD_MOUSE | FPrate:1.000 | OMEGA:S-551"
)

I want to extract the 4 digits following FPrate: and eventually also the letters following OMEGA: and the last 3 digits. I'm new to using regular expressions and have spent hours figuring this out and searching the web for an solution, but with no luck.

The desired output would be:

[1] "0.000"  
[2] "0.006"  
[3] "0.018"  
[4] "0.026"  
[5] "0.154"  
[6] "1.000"    

So far I've come up with this line of code:

gsub("^[^(FPrate:)]*(FPrate:)|(\\s\\|\\sOMEGA:)[^(\\s\\|\\sOMEGA:)]*$", "", x)

which works for some of my entries, but not all.

What is the best way to achieve this?

Upvotes: 2

Views: 2191

Answers (4)

G. Grothendieck
G. Grothendieck

Reputation: 269471

Base R

Here are some base R solutions:

1) If you just need the FPrate field (which is all that the question seems to ask for) then this sub will do. No packages are needed.

as.numeric(sub(".*FPrate:(\\S+) .*", "\\1", x))
## [1] 0.000 0.006 0.018 0.026 0.154 1.000

2) If you want to parse out all name:value fields then, again, using only base R replace leading non-spaces with a newline and then replace each occurrence of space-character-space with a newline as well. It is now in dcf format so read it in using read.dcf giving character matrix m. That may be good enough but if you want a data frame with each column appropriately type converted then convert it to the data frame d and apply type.convert. This solution is quite general since it does not hard code FPrate and OMEGA.

s <- gsub(" . ", "\n", sub("\\S+", "\n", x))
m <- read.dcf(textConnection(s))
d <- as.data.frame(m, stringsAsFactors = FALSE)
d[] <- lapply(d, type.convert)

giving:

> m
     FPrate  OMEGA  
[1,] "0.000" "D-904"
[2,] "0.006" "S-349"
[3,] "0.018" "S-337"
[4,] "0.026" "S-552"
[5,] "0.154" "S-549"
[6,] "1.000" "S-551"

> d
  FPrate OMEGA
1  0.000 D-904
2  0.006 S-349
3  0.018 S-337
4  0.026 S-552
5  0.154 S-549
6  1.000 S-551

3) This one uses strcapture and produces a data frame with types converted according to proto:

proto <- data.frame(FPrate = numeric(0), OMEGA = character(0))
strcapture(".*FPrate:(\\S+) . OMEGA:(\\S+)", x, proto)

giving:

  FPrate OMEGA
1  0.000 D-904
2  0.006 S-349
3  0.018 S-337
4  0.026 S-552
5  0.154 S-549
6  1.000 S-551

4) In this one we replace the colons with spaces, read in what is left with read.table, extract the columns we want and then set the column names. No regular expressions are used.

d <- read.table(text = chartr(":", " ", x), as.is = TRUE)[c(4, 7)]
names(d) <- c("FPrate", "OMEGA")

giving this data frame:

  FPrate OMEGA
1  0.000 D-904
2  0.006 S-349
3  0.018 S-337
4  0.026 S-552
5  0.154 S-549
6  1.000 S-551

gsubfn

5) This solutions uses the gsubfn package.

library(gsubfn)

pat <- ".*FPrate:(\\S+).*OMEGA:(\\S+)"
nms <- c("FPrate", "OMEGA")
read.pattern(text = x, pattern = pat, as.is = TRUE, col.names = nms)

giving:

  FPrate OMEGA
1  0.000 D-904
2  0.006 S-349
3  0.018 S-337
4  0.026 S-552
5  0.154 S-549
6  1.000 S-551

Upvotes: 2

Gwang-Jin Kim
Gwang-Jin Kim

Reputation: 9865

Solution using pure r-base

xx            <- strsplit(x, " \\| ")
first.numbers <- sapply(xx, function(x) gsub("FPrate:", "", x[2]))
letters       <- sapply(xx, function(x) gsub("OMEGA:(.?)-\\d+", "\\1", x[[3]]))
last.digits   <- sapply(xx, function(x) gsub("OMEGA:.?-(\\d+)", "\\1", x[[3]]))

Explanation

If you want to stick with r-base, I realized that gsub is very versatile in R. You can even capture groups with it.

In this example, to keep things simple, I would first strsplit the things by " | ":

xx <- strsplit(x, " \\| ", perl=TRUE)

If you look at xx now:

> xx
[[1]]
[1] ">104K_THEPA"  "FPrate:0.000" "OMEGA:D-904" 

[[2]]
[1] ">2MMP_ARATH"  "FPrate:0.006" "OMEGA:S-349" 

[[3]]
[1] ">5MMP_ARATH"  "FPrate:0.018" "OMEGA:S-337" 

[[4]]
[1] ">5NTD_DIPOM"  "FPrate:0.026" "OMEGA:S-552" 

[[5]]
[1] ">5NTD_HUMAN"  "FPrate:0.154" "OMEGA:S-549" 

[[6]]
[1] ">5NTD_MOUSE"  "FPrate:1.000" "OMEGA:S-551" 

So you could select only the second or third elements, and propagate through the list using sapply (which is in this case equivalent to unlist(lapply(...)) and returns you at the end a vector.

To capture the first numbers, I would do then:

first.numbers <- sapply(xx, function(x) gsub("FPrate:", "", x[2]))
first.numbers
## [1] "0.000" "0.006" "0.018" "0.026" "0.154" "1.000"

Here, I just removed "FPrate:". I could also catch the numbers through grouping. I will do it for the next captures:

letters <- sapply(xx, function(x) gsub("OMEGA:(.?)-\\d+", "\\1", x[[3]]))
letters
## [1] "D" "S" "S" "S" "S" "S"

Note, here I match the entire expression of the third elements by "OMEGA:(.?)-\\d+" but capture with the grouping () only one position (zero or one, but it will take one due to greediness). And the interesting thing is what I give for replacement for the entire expression: "\\1" - that what is captured for the first group. So in gsub, you can use references to groups \\1, \\2, etc. depending of how many groupings you added.

So we can capture the last digits:

last.digits <- sapply(xx, function(x) gsub("OMEGA:.?-(\\d+)", "\\1", x[[3]]))
last.digits
## [1] "904" "349" "337" "552" "549" "551"

gsub() isn't so bad after all, isn't it?

Upvotes: 1

divibisan
divibisan

Reputation: 12155

Using the str_match function from stringr to extract only a specific part of the match (a matching group) will make your problem much easier:

stringr::str_match(vec, 'FPrate:([^ ]*).*OMEGA:([^ ]*)')[,c(2,3)]
     [,1]    [,2]   
[1,] "0.000" "D-904"
[2,] "0.006" "S-349"
[3,] "0.018" "S-337"
[4,] "0.026" "S-552"
[5,] "0.154" "S-549"
[6,] "1.000" "S-551"

str_match matches the regex and returns a data frame: the first column is the whole match, while each following column is the contents of the parentheses in your regex in sequential order. So by taking the 2nd and 3rd columns, we get just the non-whitespace sequence following 'FPrate:' and following 'OMEGA:'.

You can add as many capturing groups as you want. For example, if you want to split the OMEGA into a letter and number, just use more groups:

stringr::str_match(vec, 'FPrate:([^ ]*).*OMEGA:([[:alnum:]])-(\\d*)')[,c(2:4)]
     [,1]    [,2] [,3] 
[1,] "0.000" "D"  "904"
[2,] "0.006" "S"  "349"
[3,] "0.018" "S"  "337"
[4,] "0.026" "S"  "552"
[5,] "0.154" "S"  "549"
[6,] "1.000" "S"  "551"

Upvotes: 2

hrbrmstr
hrbrmstr

Reputation: 78792

This uses the un-crutched stringi ops that stringr masks from you along with a readable/documented regex:

library(stringi)
library(tidyverse)

Your data:

c(
  ">104K_THEPA | FPrate:0.000 | OMEGA:D-904",
  ">2MMP_ARATH | FPrate:0.006 | OMEGA:S-349",
  ">5MMP_ARATH | FPrate:0.018 | OMEGA:S-337",
  ">5NTD_DIPOM | FPrate:0.026 | OMEGA:S-552",
  ">5NTD_HUMAN | FPrate:0.154 | OMEGA:S-549",
  ">5NTD_MOUSE | FPrate:1.000 | OMEGA:S-551"
) -> xdat

The extraction:

stri_match_first_regex(
  xdat,
  "
  FPrate:([[:digit:]]\\.[[:digit:]]+) # this grabs the FPrate amount
  .*                                  # this skips a bit generically just in case it ever differs
  OMEGA:([[:alnum:]]-[[:digit:]]+)    # this grabs the OMEGA info
  ",
  opts_regex = stri_opts_regex(comments = TRUE)
)[,2:3] %>% 
  as_data_frame() %>% 
  mutate(V1 = as.numeric(V1), V2 = stri_replace_first_fixed(V2, "-", ""))
## # A tibble: 6 x 2
##      V1 V2   
##   <dbl> <chr>
## 1 0     D904 
## 2 0.006 S349 
## 3 0.018 S337 
## 4 0.026 S552 
## 5 0.154 S549 
## 6 1     S551 

Also: rly good attempt at the regex in the question. Regexes are not pretty and often don't make alot of sense until you use them for a while.

Upvotes: 2

Related Questions