Reputation: 40333
I have the following invalid XML file:
<?xml version="1.0" encoding="utf-8" ?>
<Page num="1" crop_box="0, 0, 595, 842" media_box="0, 0, 595, 842" rotate="0">
<Flow id="1">
<Para id="1">
<Line box="90, 754.639, 120.038, 12">
<Word box="90, 754.639, 22.6704, 12">This</Word>
</Line>
</Para>
</Flow>
</Page>
<?xml version="1.0" encoding="utf-8" ?>
<Page num="1" crop_box="0, 0, 595, 842" media_box="0, 0, 595, 842" rotate="0">
<Flow id="1">
<Para id="1">
<Line box="90, 754.639, 120.038, 12">
<Word box="90, 754.639, 22.6704, 12">This</Word>
</Line>
</Para>
</Flow>
</Page>
While it is structurally invalid (it has two root elements and the XML prologue shows up twice), it can still be correctly parsed (ie. the tags are correct and content is also correct).
So, the question is, is there a StAX (or any other streaming based) XML parser in Java that would allow me to do that? I have checked all options in XMLInputFactory but none of them seem to allow the parser to accept this kind of malformed XML.
Upvotes: 0
Views: 508
Reputation: 1199
i have made an parse method that return me message which is class of the Type Message(its my class that has despction of the contents of Rss i need to filter out)
my method goes as followes
@Override
public List<Message> parse() {
// TODO Auto-generated method stub
final Message currentMessage = new Message();
RootElement root = new RootElement(RSS);
final List<Message> message = new ArrayList<Message>();
Element channel = root.getChild(CHANNEL);
Element item = channel.getChild(ITEM);
item.setEndElementListener(new EndElementListener() {
@Override
public void end() {
message.add(currentMessage.copy());
}
});
item.getChild(TITLE).setEndTextElementListener(new EndTextElementListener(){
public void end(String body) {
currentMessage.setTitle(body);
}
});
item.getChild(LINK).setEndTextElementListener(new EndTextElementListener() {
@Override
public void end(String body) {
currentMessage.setLink(body);
}
});
item.getChild(DESCRIPTION).setEndTextElementListener(new EndTextElementListener(){
public void end(String body) {
currentMessage.setDescription(body);
}
});
item.getChild(PUB_DATE).setEndTextElementListener(new EndTextElementListener(){
public void end(String body) {
currentMessage.setDate(body);
}
});
/*item.getChild(IMAGE).setEndTextElementListener(new EndTextElementListener(){
public void end(String body) {
currentMessage.setImage(body);
}
});*/
try {
Xml.parse(this.getInputStream(), Xml.Encoding.UTF_8, root.getContentHandler());
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (SAXException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
catch(Exception e){
e.printStackTrace();
}
return message;
}
Hope this helps
Upvotes: 0
Reputation: 310957
Just write yourself a FilterReader
or FilterInputStream
derived class that returns EOF once when it sees a new XML header.
Upvotes: 1
Reputation: 53694
i seriously doubt you will be able to get any standard java tool to parse the documents as is. however, you could find the boundaries yourself and parse the individual documents. just look for occurrences of "<?xml"
.
Upvotes: 2