Reputation: 724
I'm currently trying to scrape text from an HTML tree that I've parsed as follows:-
require(RCurl)
require(XML)
query.IMDB <- getURL('http://www.imdb.com/title/tt0096697/epdate') #Simpsons episodes, rated and ordered by broadcast date
names(query.IMDB)
query.IMDB
query.IMDB <- htmlParse(query.IMDB)
df.IMDB <- getNodeSet(query.IMDB, "//*/div[@class='rating rating-list']")
My first attempt was just to use grep on the resulting vector, but this fails.
data[grep("Users rated this", "", df.IMDB)]
#Error in data... object of type closure is not subsettable
My next attempt was to use grep on the individual points in the query.IMDB vector:-
vect <- numeric(length(df.IMDB))
for (i in 1:length(df.IMDB)){
vect[i] <- data[grep("Users rated this", "", df.IMDB)]
}
but this also throws the closure not subsettable error.
Finally trying the above function without data[]
around the grep
throws
Error in df.IMDB[i] <- grep("Users rated this", "", df.IMDB[i]) : replacement has length zero
I'm actually hoping to eventually replace everything except a number of the form [0-9].[0-9]
following the given text string with blank space, but I'm doing a simpler version first to get the thing working.
Can anyone advise what function I should be using to edit the text in each point on my query.IMDB vector
Upvotes: 0
Views: 1286
Reputation: 121568
No need to use grep
here (AVoid regular expression with HTML files). Use the handy function readHTMLTable
from XML
package:
library(XML)
head(readHTMLTable('http://www.imdb.com/title/tt0096697/epdate')[[1]][,c(2:4)])
Episode UserRating UserVotes
1 Simpsons Roasting on an Open Fire 8.2 2,694
2 Bart the Genius 7.8 1,167
3 Homer's Odyssey 7.5 1,005
4 There's No Disgrace Like Home 7.9 1,017
5 Bart the General 8.0 992
6 Moaning Lisa 7.4 988
This give you the table of ratings,... Maybe you should convert UserVotes to a numeric.
Upvotes: 1