Reputation: 43
I have tried to scrape the gross and budget values from IMDB.com using the rvest
package but I can't. My code is:
library(rvest)
movie <- html("http://www.imdb.com/title/tt1490017/")
movie %>%
html_node("#budget .itemprop") %>%
html_text() %>%
as.numeric()
and I get
numeric(0)
Upvotes: 1
Views: 1178
Reputation: 13580
Sam Firke provided a very neat solution. I just post mine to show a different alternative to extract the numeric value. As Sam Firke, I used the SelectorGadget. The html
function seems to work fine. Instead of tidyr, which I didn't know it had that handy function, I used gsub
:
library(rvest)
movie <- html("http://www.imdb.com/title/tt1490017/")
movie %>%
html_node(".txt-block:nth-child(11)") %>%
html_text() %>%
gsub("\\D", "", .) %>%
as.numeric()
Output:
[1] 6e+07
Upvotes: 0
Reputation: 23024
You can get the budget value like this:
library(tidyr) # for extract_numeric
library(rvest)
movie <- read_html("http://www.imdb.com/title/tt1490017/")
movie %>%
html_nodes("#titleDetails :nth-child(11)") %>%
html_text() %>%
extract_numeric()
[1] 6e+07
Your example looks similar to an example in the rvest package vignette. That vignette suggests you use SelectorGadget, which I used to find the CSS selector that would return only the Budget element. To see that element, run all but the last piped line of this series, and you'll see why I chose to parse it with extract_numeric
from tidyr.
You'll need the latest version of rvest
to run this as I'm using the read_html()
function, which has replaced html()
used in your example.
Upvotes: 1