screechOwl
screechOwl

Reputation: 28159

R regex find ranges in strings

I have a bunch of email subject lines and I'm trying to extract whether a range of values are present. This is how I'm trying to do it but am not getting the results I'd like:

library(stringi)

df1 <- data.frame(id = 1:5, string1 = NA)
df1$string1 <- c('15% off','25% off','35% off','45% off','55% off')

df1$pctOff10_20 <- stri_match_all_regex(df1$string1, '[10-20]%')


  id string1 pctOff10_20
1  1 15% off          NA
2  2 25% off          NA
3  3 35% off          NA
4  4 45% off          NA
5  5 55% off          NA

I'd like something like this:

 id string1 pctOff10_20
1  1 15% off          1
2  2 25% off          0
3  3 35% off          0
4  4 45% off          0
5  5 55% off          0

Upvotes: 1

Views: 286

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269644

1) strapply in gsubfn can do that by combining a regex (pattern= argument) and a function (FUN= argument). Below we use the formula representation of the function. Alternately we could make use of betweeen from data.table (or a number of other packages). This extracts the matches to the pattern, applies the function to it and returns the result simplifying it into a vector (rather than a list):

library(gsubfn)

btwn <- function(x, a, b) as.numeric(a <= as.numeric(x) & as.numeric(x) <= b)

transform(df1, pctOff10_20 = 
   strapply(
      X = string1, 
      pattern = "\\d+", 
      FUN = ~ btwn(x, 10, 20),
      simplify = TRUE
   )
)

2) A base solution using the same btwn function defined above is:

transform(df1, pctOff10_20 = btwn(gsub("\\D", "", string1), 10, 20))

Upvotes: 1

Cath
Cath

Reputation: 24074

Here is the way to go,

df1$pctOff10_20 <- stri_count_regex(df1$string1, '^(1\\d|20)%')

Explanation:

^                        the beginning of the string
(                        group and capture to \1:
  1                        '1'
  \d                       digits (0-9)
 |                        OR
  20                       '20'
)                        end of \1
%                        '%'

Upvotes: 3

Related Questions