Streaming an upload with HttpClient/MultipartEntity

Question

I've got a Tomcat instance right now that takes uploads and does some processing work on the data.

I want to replace this with a new servlet that conforms to a similar API. At first, I want this new servlet to just proxy all of the requests to the old one. They're running on separate JVMs, but on the same host.

I've been trying to use the HttpClient to proxy the upload, but it seems that the client waits for the stream to finish before it proxies the request. For large files, this causes the servlet to crash (I think it's buffering everything in memory).

Here's the code I'm currently using:

HttpPost httpPost = new HttpPost("http://localhost:8081/servlet");
String filePartName = request.getHeader("file_part_name");

_logger.info("Attaching file " + filePartName);

try {
    Part filePart = request.getPart(filePartName);

    MultipartEntity mpe = new MultipartEntity();
    mpe.addPart(
        filePartName,
        new InputStreamBody(filePart.getInputStream(), filePartName)
    );

    httpPost.setEntity(mpe);
} catch (ServletException | IOException e) {
    _logger.error("Caught exception trying to cross the streams, thanks Ghostbusters.", e);
    throw new IllegalStateException("Could not proxy the request", e);
}

HttpResponse postResponse;
try {
    postResponse = HTTP_CLIENT.execute(httpPost);
} catch (IOException e) {
    _logger.error("Caught exception trying to cross the streams, thanks Ghostbusters.", e);
    throw new IllegalStateException("Could not proxy the request", e);
}

I can't seem to figure out how to get HttpClient/HttpPost to stream the data as it comes in, instead of blocking until the first upload completes. Has anyone done something similar before? Is there an easier solution?

Thanks!

GPI · Accepted Answer

The issue lies in the way your request is processed by the Mime/Multiplart framework (the one you use to process your HTTPServletRequest, and access file parts).

The nature of a MIME/Multipart request is simple (at a high level), instead of having a traditionnal key=value content, those requests have much more complex syntax, that allows them to carry arbitrary, unstructured data (files to upload). It basically looks like (taken from wikipedia):

Content-type: multipart/mixed; boundary="'''frontier'''"

This is a multi-part message in MIME format.
--'''frontier'''
Content-type: text/plain

This is the body of the message.

--'''frontier'''
Content-type: application/octet-stream
Content-Disposition: form-data; name="image1"
Content-transfer-encoding: base64

PGh0bWw+CiAgPGhlYWQ+CiAgPC9oZWFkPgogIDxib2R5PgogICAgPHA+VGhpcyBpcyB0aGUg
Ym9keSBvZiB0aGUgbWVzc2FnZS48L3A+CiAgPC9ib2R5Pgo8L2h0bWw+Cg==

--'''frontier'''--

The important part to note is that parts (that are separated by the boundary '''frontier''' here) have "names" (through the Content Disposition header), then follows the content. One such request can have any number of parts.

Now of course, the most simple, straightforward way to implement the parsing of such a request is to process it till the end, detect the boundary, and create a temporary file (or in-memory cache) to hold each part, identified by name.

Seeing the framework can not know what part you will need first (you may need the second part in your servlet call before the first), it parses the whole stream, and then, gives you back the control.

Therefore your call is blocked at this line

Part filePart = request.getPart(filePartName);

Here, the framework has to wait to parse the whole MIME part, before letting you use the result (even a rethorical, super optimised parser could not both parse lazily the stream, and allow you random access to any parts of the message, you'd have to choose between the two options).

So there's not much you can do...

Except, not use the Multipart parser. I wouldn't recommend this if you're not familiar with MIME (and/or MIME libraries such as Apache James), nor confident that you are in control of your request's structure.

But if you are, then you may bypass the framework processing, and access the raw stream of the request. You'd parse the MIME structure by hand, and stop when you hit the start of the request's body, and start building your HTTP Post at this point, being carefull to actually take care of MIME level technicalities (de-base64 ? de-gzip ?, ...).

Alternatively, if you think your server crashes because of an out of memory, it may very well be possible that your framework is configured to cache contents of the multpart in memory. But if there is a way to configure it to cache to disk, then this is a possible workaround.

Streaming an upload with HttpClient/MultipartEntity

Answers (1)

Related Questions