Wagner Jorge
Wagner Jorge

Reputation: 430

Table from url using rvest package

I'd like to get the information in three tables from a website. I tried to apply the code below, but the table is in a confusing format.

url <- 'http://www.ufcstats.com/fighter-details/93fe7332d16c6ad9'
url %>% html_table(fill = TRUE)

Obs.: tidyverse and rvest have been used

Upvotes: 0

Views: 67

Answers (2)

camille
camille

Reputation: 16862

The table you're working with is tricky because there are table cells (<td> elements in HTML) that span two rows in order to repeat information. When html_table strips information out, those individual rows get concatenated and you get long strings of blank spaces and newlines.

library(dplyr)
library(rvest)

ufc <- read_html("http://www.ufcstats.com/fighter-details/93fe7332d16c6ad9") %>%
  html_table(fill = TRUE) %>%
  .[[1]] %>%
  filter(!is.na(Fighter)) # could instead use janitor::remove_empty or rowSums for number of NAs

ufc$Fighter[1]
#> [1] "Tom Aaron\n          \n        \n\n        \n          \n            Matt Ricehouse"

With some regex, you can make those blanks into your delimiters to split the cells. Information that applies to two rows (such as time) gets repeated. Originally I did this with mutate_all, but realized Event shouldn't be split—for that, instead just remove the extra spaces. Adjust as needed for other columns.

ufc %>%
  mutate_at(vars(Fighter:Pass), stringr::str_replace_all, "\\s{2,}", "|") %>%
  mutate_all(stringr::str_replace_all, "\\s{2,}", " ") %>%
  tidyr::separate_rows(everything(), sep = "\\|")
#>    W/L        Fighter Str Td Sub Pass
#> 1 loss      Tom Aaron   0  0   0    0
#> 2 loss Matt Ricehouse   0  0   0    0
#> 3  win      Tom Aaron   0  0   0    0
#> 4  win Eric Steenberg   0  0   0    0
#>                                              Event               Method Round
#> 1 Strikeforce - Henderson vs. Babalu Dec. 04, 2010                U-DEC     3
#> 2 Strikeforce - Henderson vs. Babalu Dec. 04, 2010                U-DEC     3
#> 3      Strikeforce - Heavy Artillery May. 15, 2010 SUB Guillotine Choke     1
#> 4      Strikeforce - Heavy Artillery May. 15, 2010 SUB Guillotine Choke     1
#>   Time
#> 1 5:00
#> 2 5:00
#> 3 0:56
#> 4 0:56

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389145

You need to do some cleaning of the table.

library(rvest)
library(dplyr)

url <- 'http://www.ufcstats.com/fighter-details/93fe7332d16c6ad9'

url %>% 
  read_html %>% 
  html_table(fill = TRUE) %>%
  .[[1]] %>%
  .[complete.cases(.),] %>%
  mutate_all(~gsub('\n|\\s{2,}', '', .))

#   W/L                 Fighter Str Td Sub Pass
#1 loss Tom AaronMatt Ricehouse  00 00  00   00
#2  win Tom AaronEric Steenberg  00 00  00   00

#                                            Event              Method Round Time
#1 Strikeforce - Henderson vs. BabaluDec. 04, 2010               U-DEC     3 5:00
#2      Strikeforce - Heavy ArtilleryMay. 15, 2010 SUBGuillotine Choke     1 0:56

Upvotes: 4

Related Questions