Reputation: 907
I am currently trying to parse HTML code in R. Currently I am using the XML and RCurl package to parse the information.
webpage <- getURL("http://www.imdb.com/title/tt0809504/fullcredits?ref_=tt_ov_wr#writers")
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
x <- xpathSApply(pagetree, "//*/table", xmlValue)
# do some clean up with regular expressions
x <- unlist(strsplit(x, "\n"))
x <- gsub("\t","",x)
x <- sub("^[[:space:]]*(.*?)[[:space:]]*$", "\\1", x, perl=TRUE)
x <- x[!(x %in% c("", "|"))]
head(x)
However, what I really want to do is parse only a particular part of html starting with
<h4 class="dataHeaderWithBorder">Writing Credits
and ending with
<h4 name="cast" id="cast" class="dataHeaderWithBorder">
Any help would be appreciated immensely.
Upvotes: 0
Views: 180
Reputation: 269644
The question did not specify precisely what output is desired but here is a self contained example that returns the indicated node.
library(XML)
Lines <- '<a>
<b class = "Z">abc - ABC</b>
<b class = "Z">xyz - XYZ</b>
<b>def - DEF</b>
</a>'
doc <- htmlTreeParse(Lines, asText = TRUE)
xpath <- "//b[@class = 'Z' and contains(., 'xyz')]"
getNodeSet(xmlRoot(doc), xpath)
giving:
[[1]]
<b class="Z">xyz - XYZ</b>
Upvotes: 1