Reputation: 430
I'd like to get the information in three tables from a website. I tried to apply the code below, but the table is in a confusing format.
url <- 'http://www.ufcstats.com/fighter-details/93fe7332d16c6ad9'
url %>% html_table(fill = TRUE)
Obs.: tidyverse
and rvest
have been used
Upvotes: 0
Views: 67
Reputation: 16862
The table you're working with is tricky because there are table cells (<td>
elements in HTML) that span two rows in order to repeat information. When html_table
strips information out, those individual rows get concatenated and you get long strings of blank spaces and newlines.
library(dplyr)
library(rvest)
ufc <- read_html("http://www.ufcstats.com/fighter-details/93fe7332d16c6ad9") %>%
html_table(fill = TRUE) %>%
.[[1]] %>%
filter(!is.na(Fighter)) # could instead use janitor::remove_empty or rowSums for number of NAs
ufc$Fighter[1]
#> [1] "Tom Aaron\n \n \n\n \n \n Matt Ricehouse"
With some regex, you can make those blanks into your delimiters to split the cells. Information that applies to two rows (such as time) gets repeated. Originally I did this with mutate_all
, but realized Event shouldn't be split—for that, instead just remove the extra spaces. Adjust as needed for other columns.
ufc %>%
mutate_at(vars(Fighter:Pass), stringr::str_replace_all, "\\s{2,}", "|") %>%
mutate_all(stringr::str_replace_all, "\\s{2,}", " ") %>%
tidyr::separate_rows(everything(), sep = "\\|")
#> W/L Fighter Str Td Sub Pass
#> 1 loss Tom Aaron 0 0 0 0
#> 2 loss Matt Ricehouse 0 0 0 0
#> 3 win Tom Aaron 0 0 0 0
#> 4 win Eric Steenberg 0 0 0 0
#> Event Method Round
#> 1 Strikeforce - Henderson vs. Babalu Dec. 04, 2010 U-DEC 3
#> 2 Strikeforce - Henderson vs. Babalu Dec. 04, 2010 U-DEC 3
#> 3 Strikeforce - Heavy Artillery May. 15, 2010 SUB Guillotine Choke 1
#> 4 Strikeforce - Heavy Artillery May. 15, 2010 SUB Guillotine Choke 1
#> Time
#> 1 5:00
#> 2 5:00
#> 3 0:56
#> 4 0:56
Upvotes: 1
Reputation: 389145
You need to do some cleaning of the table.
library(rvest)
library(dplyr)
url <- 'http://www.ufcstats.com/fighter-details/93fe7332d16c6ad9'
url %>%
read_html %>%
html_table(fill = TRUE) %>%
.[[1]] %>%
.[complete.cases(.),] %>%
mutate_all(~gsub('\n|\\s{2,}', '', .))
# W/L Fighter Str Td Sub Pass
#1 loss Tom AaronMatt Ricehouse 00 00 00 00
#2 win Tom AaronEric Steenberg 00 00 00 00
# Event Method Round Time
#1 Strikeforce - Henderson vs. BabaluDec. 04, 2010 U-DEC 3 5:00
#2 Strikeforce - Heavy ArtilleryMay. 15, 2010 SUBGuillotine Choke 1 0:56
Upvotes: 4