Reputation: 3660
I'm looking to scrape the third table off of this website and store it as a data frame. Below is a reproducible example
The third table is the one with "Isiah YOUNG" in the first row, third column.
library(rvest)
library(dplyr)
target_url <-
"https://flashresults.com/2017_Meets/Outdoor/06-22_USATF/004-2-02.htm"
table <- target_url %>%
read_html(options = c("DTDLOAD")) %>%
html_nodes("[id^=splitevents]") # this is the correct node
So far so good. Printing table[[1]]
shows the contents I want.
table[[1]]
{html_node}
<table id="splitevents" class="sortable" align="center">
[1] <tr>\n<th class="sorttable_nosort" width="20">Pl</th>\n<th class="sorttable_nosort" width="20">Ln</th>\n<th ...
[2] <td>1</td>\n
[3] <td>6</td>\n
[4] <td></td>\n
[5] <td>Isiah YOUNG</td>\n
[6] <td></td>\n
[7] <td>NIKE</td>\n
[8] <td>20.28 Q</td>\n
[9] <td><b><font color="grey">0.184</font></b></td>
[10] <td>2</td>\n
[11] <td>7</td>\n
[12] <td></td>\n
[13] <td>Elijah HALL-THOMPSON</td>\n
[14] <td></td>\n
[15] <td>Houston</td>\n
[16] <td>20.50 Q</td>\n
[17] <td><b><font color="grey">0.200</font></b></td>
[18] <td>3</td>\n
[19] <td>9</td>\n
[20] <td></td>\n
...
However, passing this to html_table
results in an empty data frame.
table[[1]] %>%
html_table(fill = TRUE)
[1] Pl Ln Athlete Affiliation Time
<0 rows> (or 0-length row.names)
How can I get the contents of table[[1]]
(which clearly do exist) as a data frame?
Upvotes: 1
Views: 55
Reputation: 84465
The html is full of errors and tripping up the parser and I haven't seen any easy way to fix these.
An alternative way, in this particular scenario, is to use the header count to determine the appropriate column count, then derive the row count by dividing the total td count by the number of columns; use these to convert into a matrix then dataframe.
library(rvest)
library(dplyr)
target_url <- "https://flashresults.com/2017_Meets/Outdoor/06-22_USATF/004-2-02.htm"
table <- read_html(target_url) %>%
html_node("#splitevents")
tds <- table %>% html_nodes('td') %>% html_text()
ths <- table %>% html_nodes("th") %>% html_text()
num_col <- length(ths)
num_row <- length(tds) / num_col
df <- tds %>%
matrix(nrow = num_row, ncol = num_col, byrow = TRUE) %>%
data.frame() %>%
setNames(ths)
Upvotes: 2