nattydeacs
nattydeacs

Reputation: 21

Issue loading HTML Table into R

I want to load the table at the bottom of the following webpage into R, either as a dataframe or table: https://www.lawschooldata.org/school/Yale%20University/18. My first instinct was to use the readHTMLTable function in the XML package

library(XML)
url <- "https://www.lawschooldata.org/school/Yale%20University/18"
##warning message after next line
table <- readHTMLTable(url)
table

However, this returns an empty list and gives me the following warning:

Warning message:XML content does not seem to be XML: '' 

I also tried adapting code I found here Scraping html tables into R data frames using the XML package. This worked for 5 of the 6 tables on the page, but just returned the header row and one row with values from the header row for the 6th table, which is the one I am interested in. Code below:

library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("https://www.lawschooldata.org/school/Yale%20University/18",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
##generates a list of the 6 tables on the page
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
##takes the 6th table, which is the one I am interested in
applicanttable <- tables[[6]]
##the problem is that this 6th table returns just the header row and one row of values
##equal to those the header row 
head(applicanttable)

Any insights would be greatly appreciated! For reference, I have also consulted the following posts that appear to have similar goals, but could not find a solution there:

Scraping html tables into R data frames using the XML package Extracting html table from a website in R

Upvotes: 1

Views: 328

Answers (1)

QHarr
QHarr

Reputation: 84475

The data is dynamically pulled from a nested JavaScript array, within a script tag when JavaScript runs in the browser. This doesn't happen when you use rvest to retrieve the non-rendered content (as seen in view-source).

You can regex out the appropriate nested array and then re-construct the table by splitting out the rows, adding the appropriate headers and performing some data manipulations on various columns e.g. some columns contain html which needs to be parsed to obtain the desired value.

As some columns e.g. Name contain values which could be interpreted as file paths , when using read_html, I use htmltidy to ensure handling as valid html.

N.B. If you use RSelenium then the page will render and you can just grab the table direct without reconstructing it.

TODO:

  1. There are still some data type manipulations you could choose to apply to a few columns.
  2. There is some more logic to be applied to ensure only Name is returned in Name column. Take the case of df$Name[10], this returns "Character and fitness issues" instead of Anxiousboy, due to the required value actually sitting in element.nextSibling.nextSibling of the p tag which is actually selected. These, infrequent, edge cases, need some additional logic built in. In this case, you might test for a particular string being returned then resort to re-parsing with an xpath expression.

R:

library(rvest)
#> Loading required package: xml2
#> Warning: package 'xml2' was built under R version 4.0.3
library(stringr)
library(htmltidy)
#> Warning: package 'htmltidy' was built under R version 4.0.3
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

get_value <- function(input) {
  value <- tidy_html(input) %>%
    read_html() %>%
    html_node("a, p, span") %>%
    html_text(trim = T)
  result <- ifelse(is.na(value), input, value)
  return(result)
}

tidy_result <- function(result) {
  return(gsub("<.*", "", result))
}

page <- read_html("https://www.lawschooldata.org/school/Yale%20University/18")

s <- page %>% toString()

headers <- page %>%
  html_nodes("#applicants-table th") %>%
  html_text(trim = T)

s <- stringr::str_extract(s, regex("DataTable\\(\\{\n\\s+data:(.*\\n\\]\\n\\])", dotall = T)) %>%
  gsub("\n", "", .)

rows <- stringr::str_extract_all(s, regex("(\\[.*?\\])", dotall = T))[[1]] %>% as.list()

df <- sapply(rows, function(x) {
  stringr::str_match_all(x, "'(.*?)'")[[1]][, 2]
}) %>%
  t() %>%
  as_tibble(.name_repair = "unique")
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> * `` -> ...4
#> * `` -> ...5
#> * ...

names(df) <- headers

df <- df %>%
  rowwise() %>%
  mutate(across(c("Name", "GRE", "URM", "$$$$"), .f = get_value)) %>%
  mutate_at(c("Result"), tidy_result)

write.csv(df, "Yale Applications.csv")

Created on 2021-06-23 by the reprex package (v0.3.0)


Sample output:

enter image description here

Upvotes: 1

Related Questions