Shivani Rao
Shivani Rao

Reputation: 41

R xml encountering and dealing with html entities in an xml file

Hello R's XML package users,

I am encountering a weird bug while parsing XML. It has to do with encountering HTML entities like mdash and ndash while parsing XML files.

This is the code I use:

InText = readLines(xmlFileName,n=-1)
Text = xmlValue(xmlRoot(xmlTreeParse(InText,trim=FALSE)))

I am currently eliminating these entities like mdash and ndash using the following

InText = gsub("\\&mdash"," ",InText);
InText = gsub("\\&ndash"," ",InText);

But this can really tedious, as I see the list of possible HTML.4.0 entity list.

Any ideas on how I can eliminate these while parsing the XML files

Thanks a lot for your help and ideas Shivani

Upvotes: 1

Views: 601

Answers (2)

Dieter Menne
Dieter Menne

Reputation: 10215

Try readHTML in the XML package; it has robust methods that can handle quite a few of these cases. See also Scraping html tables into R data frames using the XML package .

Upvotes: 1

daedalus
daedalus

Reputation: 10923

If you simply want to remove all named HTML entities, use a regex:

library("XML")

InText <- "<html>\
<head>\
    <title>Test &amp; Test again</title>\
</head>\
    <body>Hello &nbsp; world</body>\
</html>"

InText <- gsub("\\&[^;]+;","",InText)

Text <-  xmlValue(xmlRoot(xmlTreeParse(InText,trim=FALSE)))

Upvotes: 1

Related Questions