double07
double07

Reputation: 2575

Right way of formatting an input stream

I have the following issue: my program is passed an InputStream of which I cannot control the contents. I unmarshal my input stream using the javax library, which rightfully throws exceptions if the InputStream includes the & character not followed by "amp;"

The workaround I came up with was to create the following class:

import java.io.ByteArrayInputStream;
import java.io.FilterInputStream;
import java.io.InputStream;

/**
 * Provide an input stream where all & characters are properly encoded as &
 */
public class FormattedStream extends FilterInputStream {
  public FormattedStream(InputStream src) {
    super(new ByteArrayInputStream(StringUtil.toString(src)
      .replace("&", "&").replace("amp;amp;", "amp;").getBytes()));
  }
}

Note: StringUtil is a simple utility I have to turn an input stream into a String.

With that class in place, I now invoke the JAXB unmarshaller with:

unmarshal(new FormattedStream(inputStream));

instead of

unmarshal(inputStream);

This approach works but does seem odd for a few reasons:

1 - Because of the restriction that super must be the first element in the constructor (restriction which I fail to understand despite what I read about it), I am forced to do all my processing in one line, making the code far from readable.

2 - Converting the entire stream into a String and back to a stream seems overkill

3 - The code above is slightly incorrect in that a stream containing amp;amp; will be modified to containing amp;

I could address 1 by providing a FormatInputStream class with one method:

InputStream preProcess(InputStream inputStream)

where I would do the same operations I am currently doing in the constructor of my FormattedStream class but it seems odd to have to choose a different interface because of a coding limitation.

I could address 2 by keeping my FormattedStream constructor simple:

super(src)

and overriding the three read methods but that would involve much more coding: overriding the three read methods by replacing the & on the fly is not trivial compared to the one-line of code I currently have where I can leverage the replaceAll String method.

As for 3, it seems enough of a corner case that I don't worry about it but maybe I should...

Any suggestions on how to solve my issue in a more elegant way?

Upvotes: 2

Views: 4027

Answers (2)

mhaller
mhaller

Reputation: 14222

I agree with McDowell's answer that the most important thing is to fix the invalid data source in the first place.

Anyway, here is an InputStream which looks for lonely & characters and marries them with an additional amp; in case it's missing. Again, fixing broken data this way does not pay off most of the time.

This solution fixes the three flaws mentioned in the OP and shows only one way to implement transforming InputStreams.

  • Within the constructor, only the reference to the original InputStream is held. No processing takes place in the constructor, until the stream is really asked for data (by calls to read()).
  • The contents is not transformed to a large single String for transformation. Instead, the stream works as a stream and only performs minimal read-ahead (e.g. the four bytes necessary to find out whether & is followed by amp; or not.
  • The stream only replaces lonely &, and does not try to clean up amp;amp; in any way, because they don't happen with this solution.

.

import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayDeque;
import java.util.Deque;

public class ReplacerInputStream extends InputStream {

private static final byte[] REPLACEMENT = "amp;".getBytes();
    private final byte[] readBuf = new byte[REPLACEMENT.length];
    private final Deque<Byte> backBuf = new ArrayDeque<Byte>();
    private final InputStream in;

    public ReplacerInputStream(InputStream in) {
        this.in = in;
    }

    @Override
    public int read() throws IOException {
        if (!backBuf.isEmpty()) {
            return backBuf.pop();
        }
        int first = in.read();
        if (first == '&') {
            peekAndReplace();
        }
        return first;
    }

    private void peekAndReplace() throws IOException {
        int read = super.read(readBuf, 0, REPLACEMENT.length);
        for (int i1 = read - 1; i1 >= 0; i1--) {
            backBuf.push(readBuf[i1]);
        }
        for (int i = 0; i < REPLACEMENT.length; i++) {
            if (read != REPLACEMENT.length || readBuf[i] != REPLACEMENT[i]) {
                for (int j = REPLACEMENT.length - 1; j >= 0; j--) {
                    // In reverse order
                    backBuf.push(REPLACEMENT[j]);
                }
                return;
            }
        }
    }

}

The code has been tested with the following input data (first parameter is expected output, second parameter is raw input):

    test("Foo &amp; Bar", "Foo & Bar");
    test("&amp;&amp;&amp;", "&&&");
    test("&amp;&amp;&amp; ", "&&& ");
    test(" &amp;&amp;&amp;", " &&&");
    test("&amp;", "&");
    test("&amp;", "&amp;");
    test("&amp;&amp;", "&amp;&amp;");
    test("&amp;&amp;&amp;", "&amp;&&amp;");
    test("test", "test");
    test("", "");
    test("testtesttest&amp;", "testtesttest&");

Upvotes: 3

McDowell
McDowell

Reputation: 108949

To avoid reading all the data into RAM, you could implement a FilterInputStream (you would have to override both read() and read(byte[],int,int) and look at buffering those extra bytes somehow. This will not result in shorter code.


The real solution is to fix the invalid data source (and if you're going to automate that, you need to look at writing your own XML parser).

Your approach has a few flaws.

  • The result of String.getBytes() is system dependent; it is also a transcoding operation that may not be symmetrical with whatever StringUtil.toString does - default encodings on many systems are lossy. You should perform the transcoding using the XML document encoding.
  • A global search-and-replace like this may corrupt your document - ampersands can exist in CDATA, entities and entity declarations.

Upvotes: 0

Related Questions