Nicklas
Nicklas

Reputation: 435

Parsing XML with apostrophe

Taking the BBC News RSS feed for example, one of their news items is as follows:

<item><title>Pupils 'bullied on sports field'</title><description>bla bla..

I have some java code parsing this - however, when a title contains an apostrophe (as above), the parsing stops, so I end up with the following title: Pupils ' and then it continues on and parses the description (which is fine). How do I get it to parse the full title? The following is a segment of code from inside my for loop where I parse the info:

                    NodeList title = element.getElementsByTagName("title");
                    Element line = (Element) title.item(0);
                    tmp.setTitle(getCharacterDataFromElement(line).toString());

The exact same code is used to parse the other elements like description and pubDate etc, which are all fine.

This is the getCharacterDataFromElement method:

public static String getCharacterDataFromElement(Element e) {
    Node child = ((Node) e).getFirstChild();
    if (child instanceof CharacterData) {
        CharacterData cd = (CharacterData) child;
        return cd.getData();
    }
    return "";
}

What am I doing wrong? I use the DocumentBuilder, DocumentBuilderFactory and org.w3c.dom to work with the RSS Feed.

Upvotes: 2

Views: 1775

Answers (3)

prunge
prunge

Reputation: 23248

As davidfrancis suggested, you should iterate over all children in getCharacterDataFromElement().

Alternatively, if you can use DOM level 3, you can use the Node.getTextContent() method instead which does what you want.

NodeList title = element.getElementsByTagName("title");
Element line = (Element)title.item(0);
tmp.setTitle(line.getTextContent());

Upvotes: 0

Puce
Puce

Reputation: 38132

Well, AFAIK, apostrophe is a reserved character in XML and thus should be encoded as &apos;.

This means the BBC News RSS feed doesnt provide well-formatted XML.

The best thing would be to issue a bug report to the BBC News RSS feed provider so that they fix it.

Upvotes: -1

davidfrancis
davidfrancis

Reputation: 3849

Your getCharacterDataFromElement only looks at the first child - see if there are further child elements too and tack all the text together

HTH - DF

Upvotes: 2

Related Questions