Aaron Soderstrom
Aaron Soderstrom

Reputation: 629

R Web scrape - Error

Okay, So I am stuck on what seems would be a simple web scrape. My goal is to scrape Morningstar.com to retrieve a fund name based on the entered url. Here is the example of my code:

library(rvest)
url <- html("http://www.morningstar.com/funds/xnas/fbalx/quote.html")

url %>%
  read_html() %>%
  html_node('r_title') 

I would expect it to return the name Fidelity Balanced Fund, but instead I get the following error: {xml_missing}

Suggestions?

Aaron

edit:

I also tried scraping via XHR request, but I think my issue is not knowing what css selector or xpath to select to find the appropriate data.

XHR code:

  get.morningstar.Table1 <- function(Symbol.i,htmlnode){

  try(res <- GET(url = "http://quotes.morningstar.com/fundq/c-header",
                 query = list(
                   t=Symbol.i,
                   region="usa",
                   culture="en-US",
                   version="RET",
                   test="QuoteiFrame"
                 )
  ))

  tryCatch(x <- content(res) %>%
             html_nodes(htmlnode) %>%
             html_text() %>%
             trimws()
           , error = function(e) x <-NA)
  return(x)
} #HTML Node in this case is a vkey 

still the same question is, am I using the correct css/xpath to look up? The XHR code works great for requests that have a clear css selector.

Upvotes: 0

Views: 142

Answers (1)

IanK
IanK

Reputation: 386

OK, so it looks like the page dynamically loads the section you are targeting, so it doesn't actually get pulled in by read_html(). Interestingly, this part of the page also doesn't load using an RSelenium headless browser.

I was able to get this to work by scraping the page title (which is actually hidden on the page) and doing some regex to get rid of the junk:

library(rvest)

url <- 'http://www.morningstar.com/funds/xnas/fbalx/quote.html'

page <- read_html(url)

title <- page %>%
  html_node('title') %>%
  html_text()

symbol <- 'FBALX'
regex <- paste0(symbol, " (.*) ", symbol, ".*")

cleanTitle <- gsub(regex, '\\1', title)

As a side note, and for your future use, your first call to html_node() should include a "." before the class name you are targeting:

mypage %>%
  html_node('.myClass')

Again, this doesn't help in this specific case, since the page is failing to load the section we are trying to scrape.

A final note: other sites contain the same info and are easier to scrape (like yahoo finance).

Upvotes: 1

Related Questions