Reputation: 31
I just started to learn regular expression and stuck on one problem. I got a dataset with one column containing movie awards information.
**Award**
Won 2 Oscars. Another 7 wins & 37 nominations.
6 wins& 30 nominations
5 wins
Nominated for 1 BAFTA Film Award. Another 1 win & 3 nominations.
I would like to pull out the number before "wins" and "nominations" and add two columns for each. For example, for first one, it would be 6 for win column and 37 for nomination column
The pattern I use is
df2$nomination <- gsub(".*win[s]?|[[:punct:]]? | nomination.*", "",df2$Awards)
Not working well. I'm not sure how to write the pattern for "wins". :( Can anyone please help?
Thanks a lot!
Upvotes: 0
Views: 556
Reputation: 4790
We can use str_extract
to get the values with a regex expression
library(stringr)
text <- c("Won 2 Oscars. Another 7 wins & 37 nominations.",
"6 wins& 30 nominations",
"5 wins",
"Nominated for 1 BAFTA Film Award. Another 1 win & 3 nominations.")
df <- data.frame(text = text)
df$value1 <- str_extract(string = df$text, "\\d+\\b(?=\\swin)")
df$value2 <- str_extract(string = df$text, "\\d+\\b(?=\\snomination)")
> df
text value1 value2
1 Won 2 Oscars. Another 7 wins & 37 nominations. 7 37
2 6 wins& 30 nominations 6 30
3 5 wins 5 <NA>
4 Nominated for 1 BAFTA Film Award. Another 1 win & 3 nominations. 1 3
Upvotes: 2
Reputation: 886978
We can extract the numbers in a list
and then rbind
after padding NAs for cases where there is only a single element
lst <- regmatches(df2$Award, gregexpr("\\d+(?= \\b(wins?|nominations)\\b)",
df2$Award, perl = TRUE))
df2[c('new1', 'new2')] <- do.call(rbind, lapply(lapply(lst, `length<-`,
max(lengths(lst))), as.numeric))
df2
# Award new1 new2
#1 Won 2 Oscars. Another 7 wins & 37 nominations. 7 37
#2 6 wins& 30 nominations 6 30
#3 5 wins 5 NA
#4 Nominated for 1 BAFTA Film Award. Another 1 win & 3 nominations. 1 3
Upvotes: 0