LeMarque
LeMarque

Reputation: 783

error while extracting text using regex in R

I have a text string as shown below:

txt = "(2) 1G–1G (0)"

And, dataframe:

DF <- data.frame(txt = c('(2) 1G–1G (0)','(1) 1G–1G (4)','(2) 1G–1G (0)'))

I was trying to extract numbers within brackets in a way as shown below:

I want extracted result to be in this format:

  2 - 0

What I am using is this:

gsub('.+\\(([0-9]+)\\) 1G–1G \\(([0-9]+)\\).*$', '\\1 \\2', txt)

But What I am getting from above is:

 "(2) 1G–1G (0)"

I am not sure where is mistake. Can someone please explain why this code is not working the way I wanted it to work?

Upvotes: 2

Views: 92

Answers (3)

Onyambu
Onyambu

Reputation: 79188

Do not understand why you would say it does not work:

sub(".*\\((\\d+).*\\((\\d+).*","\\1-\\2",DF$txt)
 [1] "2-0" "1-4" "2-0"

or even:

 transform(DF,extracted=sub(".*\\((\\d+).*\\((\\d+).*","\\1 - \\2",txt))
            txt extracted
1 (2) 1G–1G (0)     2 - 0
2 (1) 1G–1G (4)     1 - 4
3 (2) 1G–1G (0)     2 - 0

Upvotes: 1

Jan
Jan

Reputation: 43169

You could extract them using base R with regexec and regmatches like so:

(df <- data.frame(txt = c('(2) 1G–1G (0)','(1) 1G–1G (4)','(2) 1G–1G (0)', 'somejunkhere')))

getNumbers <- function(col) {
  (result <- sapply(col, function(x) {
      m <- regexec("\\((\\d+)\\)[^()]*\\((\\d+)\\)", x, perl = TRUE)
      groups <- regmatches(x, m)
      (out <- ifelse(identical(groups[[1]], character(0)),
                    NA,
                    sprintf("%s - %s", groups[[1]][2], groups[[1]][3])))
    }))
}
df$extracted <- getNumbers(df$txt)
df

This yields

            txt extracted
1 (2) 1G–1G (0)     2 - 0
2 (1) 1G–1G (4)     1 - 4
3 (2) 1G–1G (0)     2 - 0
4  somejunkhere      <NA>

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626689

You may use

DF$txt <- trimws(gsub("[^()–]*\\(([0-9]+)\\)[^()–]*"," \\1 ",DF$txt))
## => [1] "2 – 0" "1 – 4" "2 – 0"

See the regex demo and the R demo online.

Details

  • [^()–]* - any 0+ chars other than (, ) and -
  • \\( - a (
  • ([0-9]+) - Group 1: one or more digits
  • \\) - a ) char
  • [^()–]* - any 0+ chars other than (, ) and -

Upvotes: 1

Related Questions