user4284784
user4284784

Reputation:

R regex to extract information from string

Someone tell me R is a good tool for data processing. so I am trying to figure out if it's possible (easy) to do regex data extraction with R.

Below is a example from Python to extract two key information:

import re

str = "oh, 100.0 dollar is 621.5 yuan"
m = re.search("([\d+\.\d+]+).*?([\d+\.\d+]+)",str)
if m:
    print m.group(1),"->",m.group(2)

Output of Python is:

100.0 -> 621.5

Really cool result from Python, but how to do it efficiently in R?

Upvotes: 2

Views: 439

Answers (4)

hwnd
hwnd

Reputation: 70732

Well, your regular expression is incorrect and does match what you expect. A character class defines a set of characters. Saying — "match one character specified by the class".

Therefore, it matches the following:

[\d+\.\d+]+   # any character of: digits (0-9), '+', '\.', digits (0-9), '+' 
              # (1 or more times)

Using base R you could use regmatches and gregexpr with the below pattern:

x <- 'oh, 100.0 dollar is 621.5 yuan'
m <- regmatches(x, gregexpr('\\d+(?:\\.\\d+)?', x, perl=T))[[1]]
paste(m[1], '->', m[2])
# [1] "100.0 -> 621.5"

Regular Expression ( Explained )

\d+           # digits (0-9) (1 or more times)
(?:           # group, but do not capture (optional):
  \.          #   '.'
  \d+         #   digits (0-9) (1 or more times)
)?            # end of grouping

Upvotes: 6

G. Grothendieck
G. Grothendieck

Reputation: 269860

Here are some approaches. Others are possible too with a variety of other packages.

1) It can be done in one line with strapply (although we will break it into two for readability). strapply applies the pattern pat to the string str and then inputs the captured strings to the function (expressed here in formula notation) and returns the result:

library(gsubfn)

# test data
str <- "oh, 100.0 dollar is 621.5 yuan"

pat <- "([\\d+\\.\\d+]+).*?([\\d+\\.\\d+]+)"   
strapply(str, pat, ~ paste(x, "->", y), simplify = TRUE)

giving:

[1] "100.0 -> 621.5"

Note that we used the same regex as in the question to show that the python regex works in R too (although we need to double the backslashes when writing it out since "\\" represents one backslash); however, we could simplify the regex slightly by using this instead:

pat <- "(\\d+\\.\\d+).*?(\\d+\\.\\d+)"   

or possibly this would be sufficient:

pat <- "([\\d.]+).*?([\\d.]+)"

In the subsequent points we use even simpler regular expressions.

2) We could also simplify the pattern like this in which case this works using strapplyc from the same package.

s <- strapplyc(str, "\\d+\\.\\d+")[[1]]
paste(s[1], "->", s[2])

giving the same answer.

3) A different approach is to split the input into words then keep only the words that represent numbers. This one does not use any packages:

g <- grep("\\d+\\.\\d+", strsplit(str, " ")[[1]], value = TRUE)
paste(g[1], "->", g[2])

giving the same answer.

Upvotes: 3

Avinash Raj
Avinash Raj

Reputation: 174786

Here's the one through series of gsub functions.

> str = "oh, 100.0 dollar is 621.5 yuan"
> sub("[[:space:]]+", " -> ", gsub("^[[:space:]]+|[[:space:]]+$", "", gsub("(\\d+(?:\\.\\d+)?)|\\S", '\\1', str, perl=T)))
[1] "100.0 -> 621.5"

Try this if the input contains more than two numbers. I just replaced the sub function in the above with gsub

> str = "oh, 100.0 dollar is 621.5 yuan 700 to 888.78"
> gsub("[[:space:]]+", " -> ", gsub("^[[:space:]]+|[[:space:]]+$", "", gsub("(\\d+(?:\\.\\d+)?)|\\S", '\\1', str, perl=T)))
[1] "100.0 -> 621.5 -> 700 -> 888.78"

[[:space:]]+ POSIX character class which matches one or more spaces.

Upvotes: 0

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193637

Sure. Something like this is also easily possible either with base R or with one of its many packages. Here's an example with the "stringi" package.

library(stringi)
m <- stri_extract_all_regex(str, "\\d+\\.\\d")[[1]]
sprintf("%s -> %s", m[1], m[2])
# [1] "100.0 -> 621.5"

A base R equivalent of the above might be to use gregexpr and regmatches:

regmatches(str, gregexpr("\\d+\\.\\d+", str))[[1]]
# [1] "100.0" "621.5"

Upvotes: 6

Related Questions