Reputation: 41
Hello R's XML package users,
I am encountering a weird bug while parsing XML. It has to do with encountering HTML entities like mdash and ndash while parsing XML files.
This is the code I use:
InText = readLines(xmlFileName,n=-1)
Text = xmlValue(xmlRoot(xmlTreeParse(InText,trim=FALSE)))
I am currently eliminating these entities like mdash and ndash using the following
InText = gsub("\\&mdash"," ",InText);
InText = gsub("\\&ndash"," ",InText);
But this can really tedious, as I see the list of possible HTML.4.0 entity list.
Any ideas on how I can eliminate these while parsing the XML files
Thanks a lot for your help and ideas Shivani
Upvotes: 1
Views: 601
Reputation: 10215
Try readHTML in the XML package; it has robust methods that can handle quite a few of these cases. See also Scraping html tables into R data frames using the XML package .
Upvotes: 1
Reputation: 10923
If you simply want to remove all named HTML entities, use a regex:
library("XML")
InText <- "<html>\
<head>\
<title>Test & Test again</title>\
</head>\
<body>Hello world</body>\
</html>"
InText <- gsub("\\&[^;]+;","",InText)
Text <- xmlValue(xmlRoot(xmlTreeParse(InText,trim=FALSE)))
Upvotes: 1