wscheib
wscheib

Reputation: 70

Error encountered when reading html code for a website in R

I'm trying to read the html code from a website in order to scrape some data, but I'm getting a weird error.

Here is an example link: www.boxofficemojo.com/movies/?id=avatar.htm

Here's the code:

library(RCurl)
library(XML)
library(rvest)

url <- paste0("www.boxofficemojo.com",movies_table[1,1])

webpage <- read_html(url)

gross_data_html <- html_nodes(webpage,".mp_box_content b")

And results:

library(RCurl)
library(XML)
library(rvest)

url <- paste0("www.boxofficemojo.com",movies_table[1,1])

webpage <- read_html(url)
> Error: 'www.boxofficemojo.com/movies/?id=avatar.htm' does not exist in current working directory ('C:/Users/Will/Documents').

gross_data_html <- html_nodes(webpage,".mp_box_content b")
> Error in html_nodes(webpage, ".mp_box_content b") : object 'webpage' not found

Why is this happening? Does it have something to do with the file type being .htm instead of .html?

Upvotes: 0

Views: 440

Answers (1)

jdobres
jdobres

Reputation: 11957

If you are sending a URL to read_html, it needs to be preceded with "http://", otherwise the function will assume the input is a local file path (which does not exist).

Wrong:

read_html('www.boxofficemojo.com/movies/?id=avatar.htm')

Right:

read_html('http://www.boxofficemojo.com/movies/?id=avatar.htm')

Upvotes: 1

Related Questions