Reputation: 8077

Efficient algo / way to parse (without any framework) a multipart/form-data request without reading everything to memory?

My question is simple: I want to write to disk a big file upload as it is arriving. I have two big files being uploaded by the same multipart/form-data form. How do I detect the end of file, in other words, how do I detect the boundary ------WebKitFormBoundaryuFPBAbBHzPMrZn8g in the middle of the arriving bytes?

Having the length of the file being uploaded would solve this problem completely, but this information is not given by the http request (just the full content-length, not the length of individual files being uploaded).

So what's the logic/strategy/algo to detect the boundary as I'm writing the bytes to disk. Of course I don't want to write the boundary thinking it is part of the file. I have to detect and stop writing to disk. Notice that I cannot load the whole file to memory before I start writing to disk. That would make the problem much easier.

Here is the format of a multipart/form-data with two files:

POST / HTTP/1.1
Host: localhost:8000
Connection: keep-alive
Content-Length: 362
Cache-Control: max-age=0
Origin: null
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36
Content-Type: multipart/form-data; boundary=----WebKitFormBoundaryuFPBAbBHzPMrZn8g
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.8,pt;q=0.6

------WebKitFormBoundaryuFPBAbBHzPMrZn8g
Content-Disposition: form-data; name="file1"; filename="binary.dat"
Content-Type: application/octet-stream

aωb
------WebKitFormBoundaryuFPBAbBHzPMrZn8g
Content-Disposition: form-data; name="file2"; filename="binary.dat"
Content-Type: application/octet-stream

aωb
------WebKitFormBoundaryuFPBAbBHzPMrZn8g--

Upvotes: 4

Answers (2)

Sapient Hetero

Reputation: 36

João provides two excellent suggestions. both of which account for the condition where the boundary ends up partially in one buffer and partially in another. Note that no matter how you implement this parser, your code will end up comparing every byte read to a boundary character AT LEAST once. The best design, IMHO, is one that makes sure you compare each exactly once.

The max boundary length (N) is 70, according to RFC1341 (https://www.w3.org/Protocols/rfc1341/7_2_Multipart.html). This doesn't include the leading "--" that precedes every instance of the boundary in the body, nor the "--" that follows the final boundary that marks the end of the body. The standard also tells us to remove any trailing spaces at the end of the boundary in the Content-Type specification, though I've yet to find a browser that puts any there.

I implemented my multipart/form-data parser in a way very similar to João's second suggestion, except that if the Content-Disposition includes a filename, my parser only stores characters that match the boundary. If a char is received that doesn't match the next boundary char when a partial match is in progress, I write the stored chars to the file and reset the buffer index to 0. If files were the only thing that could be received, my parser's buffer would only need to be as long as the max boundary length.

In practice, though, other form fields may be received, such as a "text area", with values much longer than 70 chars. Since the standards place no limit on their size, form fields may determine your buffer size rather than file uploads, depending on how you approach them and what resources your target system provides. My implementation stores non-file data in the same buffer in which the boundary match is accumulated, and extracts the value once the entire boundary string is matched.

Thanks for the well-thought-out suggestions, João. I'd give you a thumbs up if stackoverflow would let me.

Upvotes: 0

João Amaral

Reputation: 331

A first very simple approach that maybe fit your needs and you can implement using memory library functions to find and move data, would be as follows:

Assuming that your boundary is N + 1 bytes (in your case data is 40 and N is 39), allocate a buffer of any size bigger than your signature, then do a first receive of buffer size in the buffer and enter a loop that processes the data as described bellow, until you don't have any more data to receive:

1 - Look for the signature in the buffer. If you find it then you are done with your first file. Save the bytes up to the finding point and close the first file. Then open the second file, move the bytes from the end of the finding point up to the buffer end to the start of the buffer, receive bytes to complete the buffer and continue in the loop.

2 - If you don't find the data in your buffer then write to your file all data up to (buffer + sizeof(buffer) - N - 1), move the last N bytes to the start of the buffer, receive the remaining bytes to fill up the buffer and and continue on the loop.

One cleaner approach that does not move the data but requires you to examine each byte is to do as follows:

1 - Allocate a buffer of any size.

2 - Set a match counter to zero.

3 - Set a boundingData array containing the bytes of your bounding data.

4 - Enter a loop that does the following

5 - Receive bytes in the buffer up to the buffer size or to the receiving end

6 - Enter another loop that examine each byte for the extent of the received data as follows:

If the byte being examined is equal to boundingData[matchCounter] then increment the counter and check if it reached the lenght of your boundingData. If it does then close your file, open the next one and set your matchCounter to zero.

Else if matchCounter is different than zero then write(boundingData, matchCount) and after that write the examined byte to your file.

When you're done with your buffer go back to step 5 until you don't have any more data to receive.

Upvotes: 5

Efficient algo / way to parse (without any framework) a multipart/form-data request without reading everything to memory?

Answers (2)

Related Questions