userX
userX

Reputation: 315

Regex : extracting a decimal number preceded by a pattern in R

Not sure what I am doing wrong here. I have lines in a text file...the target lines look like this

  • Nsource.Inhibitor 3 81.63 27.21 1.84 0.008
  • Nsource.Inhibitor 3 90.31 17.21 0.84 <0.001

I want to extract the 0.008 and <0.001 from the end.

However, there are other lines that mean we have to use the first part of the line as part of a pattern....

  • Nsource 1 1238.10 1238.10 40.29 <.001
  • Inhibitor 3 1484.41 494.80 16.10 <.001

My attempt

reline <- "+ Nsource.Inhibitor   3   81.63   27.21   1.84    0.008"
decnum <- "[[:digit:]]+\\.*[[:digit:]]*"
chk <- paste0("+ Nsource.Inhibitor[:blank:]+", decnum, "[:blank:]+", decnum, "[:blank:]+", decnum, "[:blank:]+", decnum,
       "[:blank:]+", "([[:digit:]]+\\.*[[:digit:]]*)")
gsub(chk, "\\1",reline)

returns:

"+ Nsource.Inhibitor\t 3\t 81.63\t 27.21\t 1.84\t 0.008"

Thanks for your help.

Matt

Upvotes: 1

Views: 864

Answers (4)

Otto K&#228;ssi
Otto K&#228;ssi

Reputation: 3093

strings <- c("Nsource.Inhibitor 3 81.63 27.21 1.84 0.008", "Nsource.Inhibitor 3 90.31 17.21 0.84 <0.001",  "Nsource 1 1238.10 1238.10 40.29 <.001", "Inhibitor 3 1484.41 494.80 16.10 <.001")

The expression below uses grep to pick up strings that contain substring 'Nsource.Inhibitor', splits the string by ' ', and returns the the 6th element of each of the splitted strings.

sapply(strsplit(strings[grep('Nsource.Inhibitor', strings)], ' '), '[[',6)

Upvotes: 1

Roland
Roland

Reputation: 132999

There is no reason for using regex here. Simply read the file as a data.frame and do simple subsetting:

DF <- read.table(text = "Nsource.Inhibitor 3 81.63 27.21 1.84 0.008
           Nsource.Inhibitor 3 90.31 17.21 0.84 <0.001
           Nsource 1 1238.10 1238.10 40.29 <.001
           nhibitor 3 1484.41 494.80 16.10 <.001", stringsAsFactors = FALSE) #you can read from file directly

DF[DF$V1 == "Nsource.Inhibitor", ncol(DF)]
#[1] "0.008"  "<0.001"

Upvotes: 1

Jan
Jan

Reputation: 43199

Something like this?

library(stringr)
strings <- c("Nsource.Inhibitor 3 81.63 27.21 1.84 0.008", "Nsource.Inhibitor 3 90.31 17.21 0.84 <0.001", 
             "Nsource 1 1238.10 1238.10 40.29 <.001", "Inhibitor 3 1484.41 494.80 16.10 <.001")

str_match(strings, "(?=^Nsource.Inhibitor).*?(<?\\d+\\.\\d+)$")[,2]

This yields

[1] "0.008"  "<0.001" NA       NA      

It ensures, there's Nsource.Inhibitor at the start of the string and only then matches the last \d+.\d+ pattern of that line (plus < eventually).

Upvotes: 1

R. Schifini
R. Schifini

Reputation: 9313

If your target lines contain "Nsource.Inhibitor" and the last character is a number, and you want to extract all the characters after the last space, then try:

gsub(".*Nsource\\.Inhibitor.*\\s(.*[0-9])$", "\\1", reline)

You could add ignore.case = T if Nsource or Inhibitor appear without caps.

Examples:

> reline <- "+ Nsource.Inhibitor   3   81.63   27.21   1.84    <0.008"
> output <- gsub(".*Nsource\\.Inhibitor.*\\s(.*[0-9])$", "\\1", reline, ignore.case = T)
> output
[1] "<0.008"

> reline <- "+ Nsource.Inhibitor   3   81.11  27  1232   23  123111  55.5555  0.38"
> output <- gsub(".*Nsource\\.inhibitor.*\\s(.*[0-9])$", "\\1", reline, ignore.case = T)
> output
[1] "0.38"

Upvotes: 1

Related Questions