Josh
Josh

Reputation: 1309

Strip HTML Formatting from R Strings

I'm trying to scrape information from this url: http://www.sports-reference.com/cbb/boxscores/index.cgi?month=2&day=3&year=2017 and have gotten decently far to the point where I have strings for each game that look like this:

str <-"Yale\n\t\t\t87\n\t\t\t\n\t\t\t\tFinal\n\t\t\t\t\n\t\t\t\n\t\tColumbia\n\t\t\t78\n\t\t\t \n\t\t\t\n\t\t"

Ideally I'd like to get to a vector or dataframe that looks something like:

str_vec <- c('Yale',87,'Columbia',78)

I've tried a few things that didn't work like:

without_n <- gsub(x = str, pattern = '\n')
without_Final <- gsub(x = without_n, pattern = 'Final')
str_vec <- strslpit(x = without_Final, split = '\t')

Thanks in advance for any helpful tips/answers!

Upvotes: 0

Views: 581

Answers (1)

Vamsi Prabhala
Vamsi Prabhala

Reputation: 49260

You can use gsub to first replace all the non-alphanumeric characters in the string with an empty string. Then insert a space between the name and score. Thereafter you can split the string on space to a data structure needed.

require(stringr)

step_1 <- gsub('([^[:alnum:]]|(Final))', "", str)
#"Yale87Columbia78"

step_2 <- gsub("([[:alpha:]]+)([[:digit:]]+)", "\\1 \\2 ", step_1)
strsplit(str_trim(step_2)," ")
#"Yale" "87" "Columbia" "78" 

I assume the string pattern is consistent, for this to work reliably.

Upvotes: 2

Related Questions