Reputation: 2575
I have the following issue: my program is passed an InputStream of which I cannot control the contents. I unmarshal my input stream using the javax library, which rightfully throws exceptions if the InputStream includes the & character not followed by "amp;"
The workaround I came up with was to create the following class:
import java.io.ByteArrayInputStream;
import java.io.FilterInputStream;
import java.io.InputStream;
/**
* Provide an input stream where all & characters are properly encoded as &
*/
public class FormattedStream extends FilterInputStream {
public FormattedStream(InputStream src) {
super(new ByteArrayInputStream(StringUtil.toString(src)
.replace("&", "&").replace("amp;amp;", "amp;").getBytes()));
}
}
Note: StringUtil is a simple utility I have to turn an input stream into a String.
With that class in place, I now invoke the JAXB unmarshaller with:
unmarshal(new FormattedStream(inputStream));
instead of
unmarshal(inputStream);
This approach works but does seem odd for a few reasons:
1 - Because of the restriction that super must be the first element in the constructor (restriction which I fail to understand despite what I read about it), I am forced to do all my processing in one line, making the code far from readable.
2 - Converting the entire stream into a String and back to a stream seems overkill
3 - The code above is slightly incorrect in that a stream containing amp;amp; will be modified to containing amp;
I could address 1 by providing a FormatInputStream class with one method:
InputStream preProcess(InputStream inputStream)
where I would do the same operations I am currently doing in the constructor of my FormattedStream class but it seems odd to have to choose a different interface because of a coding limitation.
I could address 2 by keeping my FormattedStream constructor simple:
super(src)
and overriding the three read methods but that would involve much more coding: overriding the three read methods by replacing the & on the fly is not trivial compared to the one-line of code I currently have where I can leverage the replaceAll String method.
As for 3, it seems enough of a corner case that I don't worry about it but maybe I should...
Any suggestions on how to solve my issue in a more elegant way?
Upvotes: 2
Views: 4027
Reputation: 14222
I agree with McDowell's answer that the most important thing is to fix the invalid data source in the first place.
Anyway, here is an InputStream
which looks for lonely &
characters and marries them with an additional amp;
in case it's missing. Again, fixing broken data this way does not pay off most of the time.
This solution fixes the three flaws mentioned in the OP and shows only one way to implement transforming InputStreams.
&
is followed by amp;
or not.&
, and does not try to clean up amp;amp;
in any way, because they don't happen with this solution..
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayDeque;
import java.util.Deque;
public class ReplacerInputStream extends InputStream {
private static final byte[] REPLACEMENT = "amp;".getBytes();
private final byte[] readBuf = new byte[REPLACEMENT.length];
private final Deque<Byte> backBuf = new ArrayDeque<Byte>();
private final InputStream in;
public ReplacerInputStream(InputStream in) {
this.in = in;
}
@Override
public int read() throws IOException {
if (!backBuf.isEmpty()) {
return backBuf.pop();
}
int first = in.read();
if (first == '&') {
peekAndReplace();
}
return first;
}
private void peekAndReplace() throws IOException {
int read = super.read(readBuf, 0, REPLACEMENT.length);
for (int i1 = read - 1; i1 >= 0; i1--) {
backBuf.push(readBuf[i1]);
}
for (int i = 0; i < REPLACEMENT.length; i++) {
if (read != REPLACEMENT.length || readBuf[i] != REPLACEMENT[i]) {
for (int j = REPLACEMENT.length - 1; j >= 0; j--) {
// In reverse order
backBuf.push(REPLACEMENT[j]);
}
return;
}
}
}
}
The code has been tested with the following input data (first parameter is expected output, second parameter is raw input):
test("Foo & Bar", "Foo & Bar");
test("&&&", "&&&");
test("&&& ", "&&& ");
test(" &&&", " &&&");
test("&", "&");
test("&", "&");
test("&&", "&&");
test("&&&", "&&&");
test("test", "test");
test("", "");
test("testtesttest&", "testtesttest&");
Upvotes: 3
Reputation: 108949
To avoid reading all the data into RAM, you could implement a FilterInputStream
(you would have to override both read()
and read(byte[],int,int)
and look at buffering those extra bytes somehow. This will not result in shorter code.
The real solution is to fix the invalid data source (and if you're going to automate that, you need to look at writing your own XML parser).
Your approach has a few flaws.
String.getBytes()
is system dependent; it is also a transcoding operation that may not be symmetrical with whatever StringUtil.toString
does - default encodings on many systems are lossy. You should perform the transcoding using the XML document encoding.Upvotes: 0