Calum
Calum

Reputation: 3

How to extract specific data values from sentences in R?

I am fairly new to R and am attempting to extract specific numerical values from sentences. The sentences are separated in a data frame and are play descriptions in football that describe punt plays. The play descriptions are pretty much uniformly structured and look something like this.. "(15:00) (Punt formation) D.Sepulveda punts 45 yards to TEN 32, Center-G.Warren. C.Finnegan to TEN 37 for 5 yards (A.Harrison)."

I want to extract the return yards which in this example is the "5". I'm sure there is code to extract the value following "for" as it is the only "for" in all of the descriptions and as in the above example "5" follows "for" but I can't find anything online for this.

Thanks for any and all help and please let me know if anything needs explaining.

Upvotes: 0

Views: 127

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 389055

In base R, we can use sub to extract number after "for".

string <- "(15:00) (Punt formation) D.Sepulveda punts 45 yards to TEN 32, Center-G.Warren. C.Finnegan to TEN 37 for 5 yards (A.Harrison)."
sub('.*for (\\d+).*', '\\1', string)
#[1] "5"

Upvotes: 1

Ian Campbell
Ian Campbell

Reputation: 24818

We can use the stringr package's str_extract_all function. This example extracts all numbers that immediately proceed the string " yards". This is called lookahead.

library(stringr)
string <- "(15:00) (Punt formation) D.Sepulveda punts 45 yards to TEN 32, Center-G.Warren. C.Finnegan to TEN 37 for 5 yards (A.Harrison)."
str_extract_all(string = string, pattern = "[0-9]+(?= yards)")
#[[1]]
#[1] "45" "5"

If we only wanted the number that follows "for " we could also use lookbehind.

str_extract_all(string = string, pattern = "(?<=for )[0-9]+(?= yards)")
#[[1]]
#[1] "5"

Upvotes: 4

Related Questions