Maurício Linhares
Maurício Linhares

Reputation: 40333

Can I have a less validating StAX parser in Java?

I have the following invalid XML file:

<?xml version="1.0" encoding="utf-8" ?>
<Page num="1" crop_box="0, 0, 595, 842" media_box="0, 0, 595, 842" rotate="0">
    <Flow id="1">
        <Para id="1">
            <Line box="90, 754.639, 120.038, 12">
                <Word box="90, 754.639, 22.6704, 12">This</Word>
            </Line>
        </Para>
    </Flow>
</Page>
<?xml version="1.0" encoding="utf-8" ?>
<Page num="1" crop_box="0, 0, 595, 842" media_box="0, 0, 595, 842" rotate="0">
    <Flow id="1">
        <Para id="1">
            <Line box="90, 754.639, 120.038, 12">
                <Word box="90, 754.639, 22.6704, 12">This</Word>
            </Line>
        </Para>
    </Flow>
</Page>

While it is structurally invalid (it has two root elements and the XML prologue shows up twice), it can still be correctly parsed (ie. the tags are correct and content is also correct).

So, the question is, is there a StAX (or any other streaming based) XML parser in Java that would allow me to do that? I have checked all options in XMLInputFactory but none of them seem to allow the parser to accept this kind of malformed XML.

Upvotes: 0

Views: 508

Answers (3)

Basimalla Sebastin
Basimalla Sebastin

Reputation: 1199

i have made an parse method that return me message which is class of the Type Message(its my class that has despction of the contents of Rss i need to filter out)

my method goes as followes

    @Override
public List<Message> parse() {
    // TODO Auto-generated method stub
    final Message currentMessage = new Message();
    RootElement root = new RootElement(RSS);
    final List<Message> message = new ArrayList<Message>();
    Element channel = root.getChild(CHANNEL);
    Element item = channel.getChild(ITEM);

    item.setEndElementListener(new EndElementListener() {

        @Override
        public void end() {
            message.add(currentMessage.copy());     
        }
    });

    item.getChild(TITLE).setEndTextElementListener(new EndTextElementListener(){
        public void end(String body) {
            currentMessage.setTitle(body);
        }
    }); 

    item.getChild(LINK).setEndTextElementListener(new EndTextElementListener() {
        @Override
        public void end(String body) {
            currentMessage.setLink(body);   
        }
    });
    item.getChild(DESCRIPTION).setEndTextElementListener(new EndTextElementListener(){
        public void end(String body) {
            currentMessage.setDescription(body);
        }
    });
    item.getChild(PUB_DATE).setEndTextElementListener(new EndTextElementListener(){
        public void end(String body) {
            currentMessage.setDate(body);
        }
    });
    /*item.getChild(IMAGE).setEndTextElementListener(new EndTextElementListener(){
        public void end(String body) {
            currentMessage.setImage(body);
        }
    });*/

    try {
        Xml.parse(this.getInputStream(), Xml.Encoding.UTF_8, root.getContentHandler());
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (SAXException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    catch(Exception e){
        e.printStackTrace();
    }


    return message;
}

Hope this helps

Upvotes: 0

user207421
user207421

Reputation: 310957

Just write yourself a FilterReader or FilterInputStream derived class that returns EOF once when it sees a new XML header.

Upvotes: 1

jtahlborn
jtahlborn

Reputation: 53694

i seriously doubt you will be able to get any standard java tool to parse the documents as is. however, you could find the boundaries yourself and parse the individual documents. just look for occurrences of "<?xml".

Upvotes: 2

Related Questions