Reputation: 31
<td headers="apcl1" data-dyn="1" class="text-center">1<span class="hidden"> authorized course</span></td>
<td headers="apcl2" data-dyn="2" class="text-center">1<span class="hidden"> authorized course</span></td>
<td headers="apcl3" data-dyn="3" class="text-center">1<span class="hidden"> authorized course</span></td>
<td headers="apcl4" data-dyn="4" class="text-center">--<span class="hidden"> no authorized courses</span></td>
For the above HTML code, I am trying to scrape the data in the td tag between > and < span (i.e., 1, 1, 1, --).
I am using R and the rvest package and my code is below:
individual_temp_url <- "https://apcourseaudit.inflexion.org/ledger/school.php?a=MTQ4Mzk=&b=MA=="
read_html(individual_temp_url) %>%
html_nodes('td') %>%
html_text()
However, when I do this, all I get is "" for each of the td tags. Looking for help to extract the numbers for each td tag?
Upvotes: 0
Views: 58
Reputation: 173858
The td
elements are blank on the html you download. In the browser, they are populated by javascript after the page loads, from a JSON included in one of the page's script tags. You can extract this and parse the JSON to get a nice data frame:
library(rvest)
#> Loading required package: xml2
individual_temp_url <- "https://apcourseaudit.inflexion.org/ledger/school.php?a=MTQ4Mzk=&b=MA=="
df <- read_html(individual_temp_url) %>%
html_nodes('script') %>%
html_text() %>%
`[`(4) %>%
strsplit("dataSet = |\r\n|;") %>%
unlist() %>%
`[`(3) %>%
jsonlite::fromJSON()
df
#> data data data data data data data data data
#> 1 2007-08 2008-09 2009-10 2010-11 2011-12 2012-13 2013-14 2014-15 2015-16
#> 2 0 0 0 0 0 1 1 1 1
#> 3 2 2 2 2 2 2 2 2 2
#> 4 3 3 3 3 3 2 2 4 3
#> 5 1 1 1 1 1 1 1 1 2
#> 6 2 3 2 2 2 2 2 2 2
#> 7 1 1 1 1 1 1 1 1 1
#> 8 0 0 0 0 0 0 0 0 0
#> 9 1 1 1 1 1 1 1 1 1
#> 10 1 1 1 1 1 1 1 1 1
#> 11 1 1 1 1 1 2 2 3 1
#> 12 0 0 2 2 2 2 2 2 1
#> 13 0 0 1 1 1 1 1 1 1
#> 14 0 0 0 0 0 1 1 1 0
#> 15 0 0 0 0 1 1 1 1 1
#> 16 0 0 0 0 0 0 0 2 2
#> 17 0 0 0 0 0 0 0 0 1
#> 18 0 0 0 0 0 2 2 0 0
#> 19 0 0 0 0 0 0 0 0 0
#> 20 1 1 1 1 1 1 2 2 2
#> 21 1 1 1 1 1 1 1 1 1
#> 22 1 1 1 1 1 1 1 1 1
#> 23 1 1 1 1 1 2 2 2 2
#> 24 1 2 2 1 1 1 1 1 1
#> 25 2 3 4 2 1 1 1 1 2
#> 26 2 3 3 2 1 2 1 1 2
#> data data data data
#> 1 2016-17 2017-18 2018-19 2019-20
#> 2 1 1 1 0
#> 3 2 2 2 1
#> 4 0 0 1 2
#> 5 0 0 0 2
#> 6 2 2 2 1
#> 7 1 1 1 1
#> 8 1 1 1 1
#> 9 1 1 1 1
#> 10 1 2 2 1
#> 11 1 1 1 1
#> 12 2 2 2 2
#> 13 1 1 1 1
#> 14 0 0 0 0
#> 15 1 1 1 1
#> 16 2 2 2 1
#> 17 0 1 1 0
#> 18 0 0 0 0
#> 19 0 0 1 1
#> 20 0 0 1 1
#> 21 1 1 1 1
#> 22 0 0 1 0
#> 23 2 2 2 2
#> 24 1 1 0 1
#> 25 2 2 3 3
#> 26 0 0 1 1
Created on 2020-03-07 by the reprex package (v0.3.0)
Upvotes: 2