Extract text from HTML node tree with R

Question

I'm currently trying to scrape text from an HTML tree that I've parsed as follows:-

require(RCurl)
require(XML)

query.IMDB <- getURL('http://www.imdb.com/title/tt0096697/epdate') #Simpsons episodes, rated and ordered by broadcast date
names(query.IMDB)

query.IMDB

query.IMDB <- htmlParse(query.IMDB)
df.IMDB <- getNodeSet(query.IMDB, "//*/div[@class='rating rating-list']")

My first attempt was just to use grep on the resulting vector, but this fails.

data[grep("Users rated this", "", df.IMDB)]
#Error in data... object of type closure is not subsettable

My next attempt was to use grep on the individual points in the query.IMDB vector:-

vect <- numeric(length(df.IMDB))

for (i in 1:length(df.IMDB)){

      vect[i] <- data[grep("Users rated this", "", df.IMDB)]

  }

but this also throws the closure not subsettable error.

Finally trying the above function without data[] around the grep throws

Error in df.IMDB[i] <- grep("Users rated this", "", df.IMDB[i]) : replacement has length zero

I'm actually hoping to eventually replace everything except a number of the form [0-9].[0-9] following the given text string with blank space, but I'm doing a simpler version first to get the thing working.

Can anyone advise what function I should be using to edit the text in each point on my query.IMDB vector

agstudy · Accepted Answer

No need to use grep here (AVoid regular expression with HTML files). Use the handy function readHTMLTable from XML package:

library(XML)
head(readHTMLTable('http://www.imdb.com/title/tt0096697/epdate')[[1]][,c(2:4)])
                            Episode UserRating UserVotes
1 Simpsons Roasting on an Open Fire        8.2     2,694
2                   Bart the Genius        7.8     1,167
3                   Homer's Odyssey        7.5     1,005
4     There's No Disgrace Like Home        7.9     1,017
5                  Bart the General        8.0       992
6                      Moaning Lisa        7.4       988

This give you the table of ratings,... Maybe you should convert UserVotes to a numeric.

Extract text from HTML node tree with R

Answers (1)

Related Questions