Reputation: 391
I need to parse an xml file, no matter the tags in it, and read the text of all its leaves (text element only). I'm using StAX but it seems there is no way to know in advance that an element is text only (so getElementText throws an exception for not leave element). So I decided to use a filter, filtering only tag elements, and iterate throw the document in this way:
InputStream in = null;
try {
in = new FileInputStream("file.xml");
DatiEstratti de = DatiEstratti.getInstance();
// Processamento ad eventi
XMLInputFactory factory = (XMLInputFactory) XMLInputFactory.newInstance();
XMLEventReader eventReader = factory.createXMLEventReader(in);
// usa il filtro per filtrare solo i tag element
eventReader = factory.createFilteredReader(eventReader, new ElementOnlyFilter());
while (eventReader.hasNext()) {
XMLEvent event = eventReader.nextEvent();
if (event.getEventType() == XMLStreamConstants.START_ELEMENT) {
StartElement startElement = event.asStartElement();
XMLEvent peekEvent = eventReader.peek();
if(peekEvent.isEndElement()){
// questa è la prima volta che viene fatto un pop
// quindi è una foglia.
// recupera il dato.
String value = eventReader.getElementText();
logger.info("dato : " + value);
}
String nome = startElement.getName().getLocalPart();
String prefix = startElement.getName().getPrefix();
if (prefix != null) {
nome = prefix + ":" + nome;
}
de.push(nome);
logger.info("push : " + de.stampaPercorso());
} else if ((event.getEventType() == XMLStreamConstants.END_ELEMENT)) {
de.pop();
logger.info("pop : " + de.stampaPercorso());
if (0 > de.nLivelliPercorso()) {
break;
}
}
//handle more event types here...
}
... where the filter is:
public class ElementOnlyFilter implements EventFilter, StreamFilter {
/* implementation of EventFilter interface */
@Override
public boolean accept(XMLEvent event) {
return acceptInternal(event.getEventType( ));
}
/* implementation of StreamFilter interface */
@Override
public boolean accept(XMLStreamReader reader) {
return acceptInternal(reader.getEventType( ));
}
/* internal utility method */
private boolean acceptInternal(int eventType) {
return eventType == XMLStreamConstants.START_ELEMENT
|| eventType == XMLStreamConstants.END_ELEMENT;
}
}
The problem is that I got the following exception when a leave is found:
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[3,42]
Message: parser must be on START_ELEMENT to read next text
at com.sun.xml.internal.stream.XMLEventReaderImpl.getElementText(XMLEventReaderImpl.java:114)
at javax.xml.stream.util.EventReaderDelegate.getElementText(EventReaderDelegate.java:88)
at xmlparser.XmlParser.main(XmlParser.java:63)
I wonder way. Is there a fault in this code? I thought peek() does not change the reader so getElementText() should be called by a start element. Is there another way to accomplish my goal?
Upvotes: 1
Views: 7176
Reputation: 122374
Firstly, if you filter to include only start and end element events then you won't see the text contained inside your leaf nodes at all. I would use a different approach, with an unfiltered stream, like this:
XMLEventReader eventReader = factory.createXMLEventReader(in);
StringBuilder content = null;
while(eventReader.hasNext()) {
XMLEvent event = eventReader.nextEvent();
if(event.isStartElement()) {
// other start element processing here
content = new StringBuilder();
} else if(event.isEndElement()) {
if(content != null) {
// this was a leaf element
String leafText = content.toString();
// do something with the leaf node
} else {
// not a leaf
}
// in all cases, discard content
content = null;
} else if(event.isCharacters()) {
if(content != null) {
content.append(event.asCharacters().getData());
}
}
// other event types here
}
The trick is the content = null
at the end of the end element section - on entry to the if(event.isEndElement())
block if content
is non-null then you know there have been no intervening end element events between this one and its corresponding start tag, i.e. it's a leaf node.
Upvotes: 5