wjang4
wjang4

Reputation: 127

Web-scraping in R

I am practicing my web scraping coding in R and I cannot pass one phase no matter what website I try.

For example,

https://www.thecompleteuniversityguide.co.uk/league-tables/rankings?s=Music

My goal is to extract all 77 schools' name (Oxford to London Metropolitan)

So I tried...

library(rvest)
url_college <- "https://www.thecompleteuniversityguide.co.uk/league-tables/rankings?s=Music"
college <- read_html(url_college)
info <- html_nodes(college, css = '.league-table-institution-name')
info %>% html_nodes('.league-table-institution-name') %>% html_text()

From F12, I could find out that all schools' name is under class '.league-table-institution-name'... and that's why I wrote that in html_nodes...

What have I done wrong?

Upvotes: 0

Views: 210

Answers (1)

neilfws
neilfws

Reputation: 33782

You appear to be running html_nodes() twice: first on college, an xml_document (which is correct) and then on info, a character vector, which is not correct.

Try this instead:

url_college %>%
  read_html() %>%
  html_nodes('.league-table-institution-name') %>%
  html_text()

and then you'll need an additional step to clean up the school names; this one was suggested:

%>%
  str_replace_all("(^[^a-zA-Z]+)|([^a-zA-Z]+$)", "")

Upvotes: 3

Related Questions