Reputation:
Someone tell me R is a good tool for data processing. so I am trying to figure out if it's possible (easy) to do regex data extraction with R.
Below is a example from Python to extract two key information:
import re
str = "oh, 100.0 dollar is 621.5 yuan"
m = re.search("([\d+\.\d+]+).*?([\d+\.\d+]+)",str)
if m:
print m.group(1),"->",m.group(2)
Output of Python is:
100.0 -> 621.5
Really cool result from Python, but how to do it efficiently in R?
Upvotes: 2
Views: 439
Reputation: 70732
Well, your regular expression is incorrect and does match what you expect. A character class defines a set of characters. Saying — "match one character specified by the class".
Therefore, it matches the following:
[\d+\.\d+]+ # any character of: digits (0-9), '+', '\.', digits (0-9), '+'
# (1 or more times)
Using base R you could use regmatches
and gregexpr
with the below pattern:
x <- 'oh, 100.0 dollar is 621.5 yuan'
m <- regmatches(x, gregexpr('\\d+(?:\\.\\d+)?', x, perl=T))[[1]]
paste(m[1], '->', m[2])
# [1] "100.0 -> 621.5"
Regular Expression ( Explained )
\d+ # digits (0-9) (1 or more times)
(?: # group, but do not capture (optional):
\. # '.'
\d+ # digits (0-9) (1 or more times)
)? # end of grouping
Upvotes: 6
Reputation: 269860
Here are some approaches. Others are possible too with a variety of other packages.
1) It can be done in one line with strapply
(although we will break it into two for readability). strapply
applies the pattern pat
to the string str
and then inputs the captured strings to the function (expressed here in formula notation) and returns the result:
library(gsubfn)
# test data
str <- "oh, 100.0 dollar is 621.5 yuan"
pat <- "([\\d+\\.\\d+]+).*?([\\d+\\.\\d+]+)"
strapply(str, pat, ~ paste(x, "->", y), simplify = TRUE)
giving:
[1] "100.0 -> 621.5"
Note that we used the same regex as in the question to show that the python regex works in R too (although we need to double the backslashes when writing it out since "\\" represents one backslash); however, we could simplify the regex slightly by using this instead:
pat <- "(\\d+\\.\\d+).*?(\\d+\\.\\d+)"
or possibly this would be sufficient:
pat <- "([\\d.]+).*?([\\d.]+)"
In the subsequent points we use even simpler regular expressions.
2) We could also simplify the pattern like this in which case this works using strapplyc
from the same package.
s <- strapplyc(str, "\\d+\\.\\d+")[[1]]
paste(s[1], "->", s[2])
giving the same answer.
3) A different approach is to split the input into words then keep only the words that represent numbers. This one does not use any packages:
g <- grep("\\d+\\.\\d+", strsplit(str, " ")[[1]], value = TRUE)
paste(g[1], "->", g[2])
giving the same answer.
Upvotes: 3
Reputation: 174786
Here's the one through series of gsub
functions.
> str = "oh, 100.0 dollar is 621.5 yuan"
> sub("[[:space:]]+", " -> ", gsub("^[[:space:]]+|[[:space:]]+$", "", gsub("(\\d+(?:\\.\\d+)?)|\\S", '\\1', str, perl=T)))
[1] "100.0 -> 621.5"
Try this if the input contains more than two numbers. I just replaced the sub
function in the above with gsub
> str = "oh, 100.0 dollar is 621.5 yuan 700 to 888.78"
> gsub("[[:space:]]+", " -> ", gsub("^[[:space:]]+|[[:space:]]+$", "", gsub("(\\d+(?:\\.\\d+)?)|\\S", '\\1', str, perl=T)))
[1] "100.0 -> 621.5 -> 700 -> 888.78"
[[:space:]]+
POSIX character class which matches one or more spaces.
Upvotes: 0
Reputation: 193637
Sure. Something like this is also easily possible either with base R or with one of its many packages. Here's an example with the "stringi" package.
library(stringi)
m <- stri_extract_all_regex(str, "\\d+\\.\\d")[[1]]
sprintf("%s -> %s", m[1], m[2])
# [1] "100.0 -> 621.5"
A base R equivalent of the above might be to use gregexpr
and regmatches
:
regmatches(str, gregexpr("\\d+\\.\\d+", str))[[1]]
# [1] "100.0" "621.5"
Upvotes: 6