Reputation: 3
I am fairly new to R and am attempting to extract specific numerical values from sentences. The sentences are separated in a data frame and are play descriptions in football that describe punt plays. The play descriptions are pretty much uniformly structured and look something like this.. "(15:00) (Punt formation) D.Sepulveda punts 45 yards to TEN 32, Center-G.Warren. C.Finnegan to TEN 37 for 5 yards (A.Harrison)."
I want to extract the return yards which in this example is the "5". I'm sure there is code to extract the value following "for" as it is the only "for" in all of the descriptions and as in the above example "5" follows "for" but I can't find anything online for this.
Thanks for any and all help and please let me know if anything needs explaining.
Upvotes: 0
Views: 127
Reputation: 389055
In base R, we can use sub
to extract number after "for"
.
string <- "(15:00) (Punt formation) D.Sepulveda punts 45 yards to TEN 32, Center-G.Warren. C.Finnegan to TEN 37 for 5 yards (A.Harrison)."
sub('.*for (\\d+).*', '\\1', string)
#[1] "5"
Upvotes: 1
Reputation: 24818
We can use the stringr
package's str_extract_all
function. This example extracts all numbers that immediately proceed the string " yards"
. This is called lookahead.
library(stringr)
string <- "(15:00) (Punt formation) D.Sepulveda punts 45 yards to TEN 32, Center-G.Warren. C.Finnegan to TEN 37 for 5 yards (A.Harrison)."
str_extract_all(string = string, pattern = "[0-9]+(?= yards)")
#[[1]]
#[1] "45" "5"
If we only wanted the number that follows "for "
we could also use lookbehind.
str_extract_all(string = string, pattern = "(?<=for )[0-9]+(?= yards)")
#[[1]]
#[1] "5"
Upvotes: 4