Reputation: 435
Taking the BBC News RSS feed for example, one of their news items is as follows:
<item><title>Pupils 'bullied on sports field'</title><description>bla bla..
I have some java code parsing this - however, when a title contains an apostrophe (as above), the parsing stops, so I end up with the following title: Pupils ' and then it continues on and parses the description (which is fine). How do I get it to parse the full title? The following is a segment of code from inside my for loop where I parse the info:
NodeList title = element.getElementsByTagName("title");
Element line = (Element) title.item(0);
tmp.setTitle(getCharacterDataFromElement(line).toString());
The exact same code is used to parse the other elements like description and pubDate etc, which are all fine.
This is the getCharacterDataFromElement method:
public static String getCharacterDataFromElement(Element e) {
Node child = ((Node) e).getFirstChild();
if (child instanceof CharacterData) {
CharacterData cd = (CharacterData) child;
return cd.getData();
}
return "";
}
What am I doing wrong? I use the DocumentBuilder, DocumentBuilderFactory and org.w3c.dom to work with the RSS Feed.
Upvotes: 2
Views: 1775
Reputation: 23248
As davidfrancis suggested, you should iterate over all children in getCharacterDataFromElement()
.
Alternatively, if you can use DOM level 3, you can use the Node.getTextContent() method instead which does what you want.
NodeList title = element.getElementsByTagName("title");
Element line = (Element)title.item(0);
tmp.setTitle(line.getTextContent());
Upvotes: 0
Reputation: 38132
Well, AFAIK, apostrophe is a reserved character in XML and thus should be encoded as '
.
This means the BBC News RSS feed doesnt provide well-formatted XML.
The best thing would be to issue a bug report to the BBC News RSS feed provider so that they fix it.
Upvotes: -1
Reputation: 3849
Your getCharacterDataFromElement only looks at the first child - see if there are further child elements too and tack all the text together
HTH - DF
Upvotes: 2