tfr950
tfr950

Reputation: 422

Web scraping tables on college basketball stats

I am new to webscraping and working on a test project in which I am trying to scrape every table of data on the following website for this particular team. There should be 15 tables but when I run my code, it only seems to pull the first 6 of the 15. How do I go about getting the rest of the tables?

Here is the code:

library(tidyverse)
library(rvest)
library(stringr)
library(lubridate)
library(magrittr)
iowa_stats<- read_html("https://www.sports-reference.com/cbb/schools/iowa/2021.html")

iowa_stats %>% html_table()

Edit: So I decided to dig a little bit deeper into the problem and see if I could get any more insights. So I decided to start with the first table that doesn't appear when you call the html_table command which is the 'Totals' Table. I did the following to follow the path of the html all the way down to the table to see if I could figure out what's wrong. To do so, I used the following code.

iowa_stats %>% html_nodes("body") %>% html_nodes("div#wrap") %>% html_nodes("div#all_totals.table_wrapper")

This is as far as I can get prior to getting an error. At the next step, there should be the following: div#div_totals.table_container.is_setup in which the table is stored but if I were to add that to the above code, it doesn't exist. When I type the following, it doesn't exist as well.

iowa_stats %>% html_nodes("body") %>% html_nodes("div#wrap") %>% html_nodes("div#all_totals.table_wrapper") %>% html_nodes("div")

Does someone who is better with html/css have any idea why this is the case?

Upvotes: 1

Views: 381

Answers (1)

Dave2e
Dave2e

Reputation: 24089

It looks like this webpage is storing some of the tables as comments. To solve this read and save the web page. Remove the comment tags and then process normally.

library(rvest)
library(dplyr)

iowa_stats<- read_html("https://www.sports-reference.com/cbb/schools/iowa/2021.html")
#Only save and work with the body
body<-html_node(iowa_stats,"body")
write_xml(body, "temp.xml")

#Find and remove comments
lines<-readLines("temp.xml")
lines<-lines[-grep("<!--", lines)]
lines<-lines[-grep("-->", lines)]
writeLines(lines, "temp2.xml")

#Read the file back in and process normally
body<-read_html("temp2.xml")
html_nodes(body, "table") %>% html_table()

Upvotes: 1

Related Questions