Fixpoint
Fixpoint

Reputation: 9860

getCharacterOffset() returns incorrect value

I'm using StAX to parse an XML file and would like to know where each tag starts and ends. For that I'm trying to use getLocation().getCharacterOffset(), but it returns incorrect values for every tag beyond first.

XMLInputFactory factory = XMLInputFactory.newInstance();
XMLEventReader reader = factory.createXMLEventReader(
        new StringReader("<root>txt1<tag>txt2</tag></root>"));

XMLEvent e;
e = reader.nextEvent(); // START_DOCUMENT
System.out.println(e);
System.out.println(e.getLocation());
e = reader.nextEvent(); // START_ELEMENT "root"
System.out.println(e);
System.out.println(e.getLocation());
e = reader.nextEvent(); // CHARACTERS "txt1"
System.out.println(e);
System.out.println(e.getLocation());
e = reader.nextEvent(); // START_ELEMENT "tag"
System.out.println(e);
System.out.println(e.getLocation());

The code above prints this:

<?xml version="null" encoding='null' standalone='no'?>
Line number = 1
Column number = 1
System Id = null
Public Id = null
Location Uri= null
CharacterOffset = 0

<root>
Line number = 1
Column number = 7
System Id = null
Public Id = null
Location Uri= null
CharacterOffset = 6

txt1
Line number = 1
Column number = 12
System Id = null
Public Id = null
Location Uri= null
CharacterOffset = 11

<tag>
Line number = 1
Column number = 16
System Id = null
Public Id = null
Location Uri= null
CharacterOffset = 15

After <root> the CharacterOffset is correctly 6, but then after txt1 it is 11 while I expect to see 10. What offset exactly does it return?

Upvotes: 2

Views: 782

Answers (1)

chris
chris

Reputation: 3563

This is probably a bug/feature of Sun/Oracle's StAX implementation. With Woodstox, you get 0, 0, 6, 10, which seems to be correct. Download Woodstox from http://wiki.fasterxml.com/WoodstoxHome and add the JARs (woodstox-core + stax2-api) to your class path. Then, XMLInputFactory will automatically pick the Woodstox implementation.

Upvotes: 2

Related Questions