Reputation: 1420
I have a big text file that is a sequence of XML-valid documents that looks something like this:
<DOC>
<TEXT> ... </TEXT>
...
</DOC>
<DOC>
<TEXT> ... </TEXT>
...
</DOC>
etc. There is no <?xml version="1.0">
, the <DOC></DOC>
delimits each separate xml. What's the best way to parse this in Java and get the values under <TEXT>
in each <DOC>
?
If I pass the whole thing to a DocumentBuilder, I get an error saying the document is not well formed. Is there a better solution than simply traversing through, a building a string for each <DOC>
?
Upvotes: 1
Views: 6588
Reputation: 758
A valid XML document must have a root element under which you can specify all other elements. Also, in a document only ONE root element can be present. have a look on XML Specification (see point 2)
So, to overcome your issue, you can take all the content of your text file into a String (or StringBuffer/StringBuilder...) And put this string in between <root>
and </root>
tags
e.g ,
String origXML = readContentFromTextFile(fileName);
String validXML = "<root>" + origXML + "</root>";
//parse validXML
Upvotes: 5
Reputation: 32831
You could create a subclass of InputStream that adds a prefix and a suffix to the input stream, and pass an instance of that class to any XML parser:
public class EnclosedInputStream extends InputStream {
private enum State {
PREFIX, STREAM, SUFFIX, EOF
};
private final byte[] prefix;
private final InputStream stream;
private final byte[] suffix;
private State state = State.PREFIX;
private int index;
EnclosedInputStream(byte [] prefix, InputStream stream, byte[] suffix) {
this.prefix = prefix;
this.stream = stream;
this.suffix = suffix;
}
@Override
public int read() throws IOException {
if (state == State.PREFIX) {
if (index < prefix.length) {
return prefix[index++] & 0xFF;
}
state = State.STREAM;
}
if (state == State.STREAM) {
int r = stream.read();
if (r >= 0) {
return r;
}
state = State.SUFFIX;
index = 0;
}
if (state == State.SUFFIX) {
if (index < suffix.length) {
return suffix[index++] & 0xFF;
}
state = State.EOF;
}
return -1;
}
}
Upvotes: 0
Reputation: 7809
You'll have a hard time parsing this with a "standard" XML parser such as Xerces. As you mentioned this XML document is not well-formed in part because it is missing an XML declaration <?xml version="1.0"?>
but most importantly because it has two document roots (i.e. the <doc>
elements).
I suggest you give TagSoup a try. It is intented to parse (quote) "poor, nasty and brutish" XML. No guarantee but that's probably your best shot.
Upvotes: 1
Reputation: 5155
The document is not well formed because you don't have a 'root' node:
<ROOT>
<DOC>
<TEXT> ... </TEXT>
...
</DOC>
<DOC>
<TEXT> ... </TEXT>
...
</DOC>
</ROOT>
Upvotes: 2