bigbodyhoop
bigbodyhoop

Reputation: 31

Is there a way to web scrape HTML table data that keeps showing up as "" when using rvest tools?

        <td headers="apcl1" data-dyn="1" class="text-center">1<span class="hidden"> authorized course</span></td>
        <td headers="apcl2" data-dyn="2" class="text-center">1<span class="hidden"> authorized course</span></td>
        <td headers="apcl3" data-dyn="3" class="text-center">1<span class="hidden"> authorized course</span></td>
        <td headers="apcl4" data-dyn="4" class="text-center">--<span class="hidden"> no authorized courses</span></td>

For the above HTML code, I am trying to scrape the data in the td tag between > and < span (i.e., 1, 1, 1, --).

I am using R and the rvest package and my code is below:

individual_temp_url <- "https://apcourseaudit.inflexion.org/ledger/school.php?a=MTQ4Mzk=&b=MA=="

read_html(individual_temp_url) %>%
html_nodes('td') %>%
html_text()

However, when I do this, all I get is "" for each of the td tags. Looking for help to extract the numbers for each td tag?

Upvotes: 0

Views: 58

Answers (1)

Allan Cameron
Allan Cameron

Reputation: 173858

The td elements are blank on the html you download. In the browser, they are populated by javascript after the page loads, from a JSON included in one of the page's script tags. You can extract this and parse the JSON to get a nice data frame:

library(rvest)
#> Loading required package: xml2
individual_temp_url <- "https://apcourseaudit.inflexion.org/ledger/school.php?a=MTQ4Mzk=&b=MA=="

df <- read_html(individual_temp_url) %>%
html_nodes('script') %>%
html_text() %>%
  `[`(4) %>%
  strsplit("dataSet = |\r\n|;") %>%
  unlist() %>%
  `[`(3) %>%
  jsonlite::fromJSON()

df
#>       data    data    data    data    data    data    data    data    data
#> 1  2007-08 2008-09 2009-10 2010-11 2011-12 2012-13 2013-14 2014-15 2015-16
#> 2        0       0       0       0       0       1       1       1       1
#> 3        2       2       2       2       2       2       2       2       2
#> 4        3       3       3       3       3       2       2       4       3
#> 5        1       1       1       1       1       1       1       1       2
#> 6        2       3       2       2       2       2       2       2       2
#> 7        1       1       1       1       1       1       1       1       1
#> 8        0       0       0       0       0       0       0       0       0
#> 9        1       1       1       1       1       1       1       1       1
#> 10       1       1       1       1       1       1       1       1       1
#> 11       1       1       1       1       1       2       2       3       1
#> 12       0       0       2       2       2       2       2       2       1
#> 13       0       0       1       1       1       1       1       1       1
#> 14       0       0       0       0       0       1       1       1       0
#> 15       0       0       0       0       1       1       1       1       1
#> 16       0       0       0       0       0       0       0       2       2
#> 17       0       0       0       0       0       0       0       0       1
#> 18       0       0       0       0       0       2       2       0       0
#> 19       0       0       0       0       0       0       0       0       0
#> 20       1       1       1       1       1       1       2       2       2
#> 21       1       1       1       1       1       1       1       1       1
#> 22       1       1       1       1       1       1       1       1       1
#> 23       1       1       1       1       1       2       2       2       2
#> 24       1       2       2       1       1       1       1       1       1
#> 25       2       3       4       2       1       1       1       1       2
#> 26       2       3       3       2       1       2       1       1       2
#>       data    data    data    data
#> 1  2016-17 2017-18 2018-19 2019-20
#> 2        1       1       1       0
#> 3        2       2       2       1
#> 4        0       0       1       2
#> 5        0       0       0       2
#> 6        2       2       2       1
#> 7        1       1       1       1
#> 8        1       1       1       1
#> 9        1       1       1       1
#> 10       1       2       2       1
#> 11       1       1       1       1
#> 12       2       2       2       2
#> 13       1       1       1       1
#> 14       0       0       0       0
#> 15       1       1       1       1
#> 16       2       2       2       1
#> 17       0       1       1       0
#> 18       0       0       0       0
#> 19       0       0       1       1
#> 20       0       0       1       1
#> 21       1       1       1       1
#> 22       0       0       1       0
#> 23       2       2       2       2
#> 24       1       1       0       1
#> 25       2       2       3       3
#> 26       0       0       1       1

Created on 2020-03-07 by the reprex package (v0.3.0)

Upvotes: 2

Related Questions