Reputation: 70
I'm trying to read the html code from a website in order to scrape some data, but I'm getting a weird error.
Here is an example link: www.boxofficemojo.com/movies/?id=avatar.htm
Here's the code:
library(RCurl)
library(XML)
library(rvest)
url <- paste0("www.boxofficemojo.com",movies_table[1,1])
webpage <- read_html(url)
gross_data_html <- html_nodes(webpage,".mp_box_content b")
And results:
library(RCurl)
library(XML)
library(rvest)
url <- paste0("www.boxofficemojo.com",movies_table[1,1])
webpage <- read_html(url)
> Error: 'www.boxofficemojo.com/movies/?id=avatar.htm' does not exist in current working directory ('C:/Users/Will/Documents').
gross_data_html <- html_nodes(webpage,".mp_box_content b")
> Error in html_nodes(webpage, ".mp_box_content b") : object 'webpage' not found
Why is this happening? Does it have something to do with the file type being .htm instead of .html?
Upvotes: 0
Views: 440
Reputation: 11957
If you are sending a URL to read_html
, it needs to be preceded with "http://", otherwise the function will assume the input is a local file path (which does not exist).
Wrong:
read_html('www.boxofficemojo.com/movies/?id=avatar.htm')
Right:
read_html('http://www.boxofficemojo.com/movies/?id=avatar.htm')
Upvotes: 1