Reputation: 10441
I am decent with R's rvest library for scraping websites, but am struggling with something new. From this webpage - http://www.naia.org/ViewArticle.dbml?ATCLID=205323044 - I am trying to scrape the main table of colleges.
Here is what my code looks like currently:
NAIA_url = "http://www.naia.org/ViewArticle.dbml?ATCLID=205323044"
NAIA_page = read_html(NAIA_url)
tables = html_table(html_nodes(NAIA_page, 'table'))
# tables returns a length-2 list, however neither of these tables are the table I desire.
# grab the correct iframe node
iframe = html_nodes(NAIA_page, "iframe")[3]
However I'm struggling past this. (1) for some reason calling html_nodes isn't grabbing the table I want. (2) and I'm not sure if I should instead grab the iframe and then try to grab the table from within it.
Any help appreciated!
Upvotes: 2
Views: 2783
Reputation: 5008
If the embedded iframe is html, you can grab the iframe
source and get your desired table from there.
library(rvest)
#> Loading required package: xml2
library(magrittr)
"http://www.naia.org/ViewArticle.dbml?ATCLID=205323044" %>%
read_html() %>%
html_nodes("iframe") %>%
extract(3) %>%
html_attr("src") %>%
read_html() %>%
html_node("#searchResultsTable") %>%
html_table() %>%
head()
#> College or University City, State
#> 1 Central Christian College ATHLETICS McPherson, KS
#> 2 + Crowley's Ridge College ATHLETICS Paragould, AR
#> 3 Edward Waters College ATHLETICS Jacksonville, Fl
#> 4 Fisher College ADMISSIONS | ATHLETICS Boston, MA
#> 5 Georgia Gwinnett College ADMISSIONS | ATHLETICS Lawrenceville, GA
#> 6 Lincoln Christian University ADMISSIONS | ATHLETICS Lincoln, IL
#> Conference Enrollment
#> 1 A.I.I. 259
#> 2 A.I.I. 0
#> 3 A.I.I. 805
#> 4 A.I.I. 600
#> 5 A.I.I. 9,720
#> 6 A.I.I. 1,060
Upvotes: 6