Reputation: 21
I want to load the table at the bottom of the following webpage into R, either as a dataframe or table: https://www.lawschooldata.org/school/Yale%20University/18. My first instinct was to use the readHTMLTable function in the XML package
library(XML)
url <- "https://www.lawschooldata.org/school/Yale%20University/18"
##warning message after next line
table <- readHTMLTable(url)
table
However, this returns an empty list and gives me the following warning:
Warning message:XML content does not seem to be XML: ''
I also tried adapting code I found here Scraping html tables into R data frames using the XML package. This worked for 5 of the 6 tables on the page, but just returned the header row and one row with values from the header row for the 6th table, which is the one I am interested in. Code below:
library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("https://www.lawschooldata.org/school/Yale%20University/18",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
##generates a list of the 6 tables on the page
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
##takes the 6th table, which is the one I am interested in
applicanttable <- tables[[6]]
##the problem is that this 6th table returns just the header row and one row of values
##equal to those the header row
head(applicanttable)
Any insights would be greatly appreciated! For reference, I have also consulted the following posts that appear to have similar goals, but could not find a solution there:
Scraping html tables into R data frames using the XML package Extracting html table from a website in R
Upvotes: 1
Views: 328
Reputation: 84475
The data is dynamically pulled from a nested JavaScript array, within a script
tag when JavaScript runs in the browser. This doesn't happen when you use rvest
to retrieve the non-rendered content (as seen in view-source).
You can regex out the appropriate nested array and then re-construct the table by splitting out the rows, adding the appropriate headers and performing some data manipulations on various columns e.g. some columns contain html which needs to be parsed to obtain the desired value.
As some columns e.g. Name
contain values which could be interpreted as file paths , when using read_html
, I use htmltidy
to ensure handling as valid html.
N.B. If you use RSelenium then the page will render and you can just grab the table direct without reconstructing it.
TODO:
Name
is returned in Name
column. Take the case of df$Name[10]
, this returns "Character and fitness issues"
instead of Anxiousboy
, due to the required value actually sitting in element.nextSibling.nextSibling
of the p
tag which is actually selected. These, infrequent, edge cases, need some additional logic built in. In this case, you might test for a particular string being returned then resort to re-parsing with an xpath expression.R:
library(rvest)
#> Loading required package: xml2
#> Warning: package 'xml2' was built under R version 4.0.3
library(stringr)
library(htmltidy)
#> Warning: package 'htmltidy' was built under R version 4.0.3
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
get_value <- function(input) {
value <- tidy_html(input) %>%
read_html() %>%
html_node("a, p, span") %>%
html_text(trim = T)
result <- ifelse(is.na(value), input, value)
return(result)
}
tidy_result <- function(result) {
return(gsub("<.*", "", result))
}
page <- read_html("https://www.lawschooldata.org/school/Yale%20University/18")
s <- page %>% toString()
headers <- page %>%
html_nodes("#applicants-table th") %>%
html_text(trim = T)
s <- stringr::str_extract(s, regex("DataTable\\(\\{\n\\s+data:(.*\\n\\]\\n\\])", dotall = T)) %>%
gsub("\n", "", .)
rows <- stringr::str_extract_all(s, regex("(\\[.*?\\])", dotall = T))[[1]] %>% as.list()
df <- sapply(rows, function(x) {
stringr::str_match_all(x, "'(.*?)'")[[1]][, 2]
}) %>%
t() %>%
as_tibble(.name_repair = "unique")
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> * `` -> ...4
#> * `` -> ...5
#> * ...
names(df) <- headers
df <- df %>%
rowwise() %>%
mutate(across(c("Name", "GRE", "URM", "$$$$"), .f = get_value)) %>%
mutate_at(c("Result"), tidy_result)
write.csv(df, "Yale Applications.csv")
Created on 2021-06-23 by the reprex package (v0.3.0)
Sample output:
Upvotes: 1