Jianan He
Jianan He

Reputation: 31

Regular Expression pattern- get number before specific words-gsub

I just started to learn regular expression and stuck on one problem. I got a dataset with one column containing movie awards information.

**Award** 
    Won 2 Oscars. Another 7 wins & 37 nominations.
    6 wins& 30 nominations
    5 wins
    Nominated for 1 BAFTA Film Award. Another 1 win & 3 nominations.

I would like to pull out the number before "wins" and "nominations" and add two columns for each. For example, for first one, it would be 6 for win column and 37 for nomination column

The pattern I use is

df2$nomination <- gsub(".*win[s]?|[[:punct:]]? | nomination.*", "",df2$Awards)

Not working well. I'm not sure how to write the pattern for "wins". :( Can anyone please help?

Thanks a lot!

Upvotes: 0

Views: 556

Answers (2)

Hardik Gupta
Hardik Gupta

Reputation: 4790

We can use str_extract to get the values with a regex expression

library(stringr)
text <- c("Won 2 Oscars. Another 7 wins & 37 nominations.",
          "6 wins& 30 nominations",
          "5 wins",
          "Nominated for 1 BAFTA Film Award. Another 1 win & 3 nominations.")
df <- data.frame(text = text)

df$value1 <- str_extract(string = df$text, "\\d+\\b(?=\\swin)")
df$value2 <- str_extract(string = df$text, "\\d+\\b(?=\\snomination)")

> df
                                                              text value1 value2
1                   Won 2 Oscars. Another 7 wins & 37 nominations.      7     37
2                                           6 wins& 30 nominations      6     30
3                                                           5 wins      5   <NA>
4 Nominated for 1 BAFTA Film Award. Another 1 win & 3 nominations.      1      3

Upvotes: 2

akrun
akrun

Reputation: 886978

We can extract the numbers in a list and then rbind after padding NAs for cases where there is only a single element

lst <- regmatches(df2$Award, gregexpr("\\d+(?= \\b(wins?|nominations)\\b)", 
               df2$Award, perl = TRUE))
df2[c('new1', 'new2')] <- do.call(rbind, lapply(lapply(lst, `length<-`, 
                             max(lengths(lst))), as.numeric))
df2
#                                                             Award new1 new2
#1                   Won 2 Oscars. Another 7 wins & 37 nominations.    7   37
#2                                           6 wins& 30 nominations    6   30
#3                                                           5 wins    5   NA
#4 Nominated for 1 BAFTA Film Award. Another 1 win & 3 nominations.    1    3

Upvotes: 0

Related Questions