Greg
Greg

Reputation: 3660

Information lost by html_table

I'm looking to scrape the third table off of this website and store it as a data frame. Below is a reproducible example

The third table is the one with "Isiah YOUNG" in the first row, third column.

library(rvest)
library(dplyr)

target_url <-
  "https://flashresults.com/2017_Meets/Outdoor/06-22_USATF/004-2-02.htm"

table <- target_url %>%
  read_html(options = c("DTDLOAD")) %>%
  html_nodes("[id^=splitevents]") # this is the correct node

So far so good. Printing table[[1]] shows the contents I want.

table[[1]]
{html_node}
<table id="splitevents" class="sortable" align="center">
 [1] <tr>\n<th class="sorttable_nosort" width="20">Pl</th>\n<th class="sorttable_nosort" width="20">Ln</th>\n<th ...
 [2] <td>1</td>\n
 [3] <td>6</td>\n
 [4] <td></td>\n
 [5] <td>Isiah YOUNG</td>\n
 [6] <td></td>\n
 [7] <td>NIKE</td>\n
 [8] <td>20.28 Q</td>\n
 [9] <td><b><font color="grey">0.184</font></b></td>
[10] <td>2</td>\n
[11] <td>7</td>\n
[12] <td></td>\n
[13] <td>Elijah HALL-THOMPSON</td>\n
[14] <td></td>\n
[15] <td>Houston</td>\n
[16] <td>20.50 Q</td>\n
[17] <td><b><font color="grey">0.200</font></b></td>
[18] <td>3</td>\n
[19] <td>9</td>\n
[20] <td></td>\n
...

However, passing this to html_table results in an empty data frame.

table[[1]] %>%
  html_table(fill = TRUE)
[1] Pl          Ln                      Athlete                 Affiliation Time                   
<0 rows> (or 0-length row.names)

How can I get the contents of table[[1]] (which clearly do exist) as a data frame?

Upvotes: 1

Views: 55

Answers (1)

QHarr
QHarr

Reputation: 84465

The html is full of errors and tripping up the parser and I haven't seen any easy way to fix these.

An alternative way, in this particular scenario, is to use the header count to determine the appropriate column count, then derive the row count by dividing the total td count by the number of columns; use these to convert into a matrix then dataframe.

library(rvest)
library(dplyr)

target_url <- "https://flashresults.com/2017_Meets/Outdoor/06-22_USATF/004-2-02.htm"

table <- read_html(target_url) %>%
  html_node("#splitevents")

tds <- table %>% html_nodes('td') %>% html_text()
ths <- table %>% html_nodes("th") %>% html_text()
num_col <- length(ths)
num_row <- length(tds) / num_col
  
df <- tds %>%
  matrix(nrow = num_row, ncol = num_col, byrow = TRUE) %>%
  data.frame() %>%
  setNames(ths)

Upvotes: 2

Related Questions