thchar
thchar

Reputation: 43

How to scrape the budget value of a movie from IMDB using rvest

I have tried to scrape the gross and budget values from IMDB.com using the rvest package but I can't. My code is:

library(rvest)    
movie <- html("http://www.imdb.com/title/tt1490017/")   
movie %>% 
html_node("#budget .itemprop") %>%     
html_text() %>%      
as.numeric()

and I get

numeric(0)

Upvotes: 1

Views: 1178

Answers (2)

mpalanco
mpalanco

Reputation: 13580

Sam Firke provided a very neat solution. I just post mine to show a different alternative to extract the numeric value. As Sam Firke, I used the SelectorGadget. The html function seems to work fine. Instead of tidyr, which I didn't know it had that handy function, I used gsub:

library(rvest)    
movie <- html("http://www.imdb.com/title/tt1490017/") 
movie %>% 
  html_node(".txt-block:nth-child(11)") %>%
  html_text() %>% 
  gsub("\\D", "", .) %>% 
  as.numeric()

Output:

[1] 6e+07

Upvotes: 0

Sam Firke
Sam Firke

Reputation: 23024

You can get the budget value like this:

library(tidyr) # for extract_numeric
library(rvest)
movie <- read_html("http://www.imdb.com/title/tt1490017/")
movie %>%
html_nodes("#titleDetails :nth-child(11)") %>%     
  html_text() %>%      
  extract_numeric()

[1] 6e+07

Your example looks similar to an example in the rvest package vignette. That vignette suggests you use SelectorGadget, which I used to find the CSS selector that would return only the Budget element. To see that element, run all but the last piped line of this series, and you'll see why I chose to parse it with extract_numeric from tidyr.

You'll need the latest version of rvest to run this as I'm using the read_html() function, which has replaced html() used in your example.

Upvotes: 1

Related Questions