hotzen
hotzen

Reputation: 2873

Parsing HTTP - Bytes.length != String.length

I consume HTTP via nio.SocketChannel, so I get chunks of data as Array[Byte]. I want to put these chunks into a parser and continue parsing after each chunk has been put.

HTTP itself seems to use an ISO8859-Charset but the Payload/Body itself may be arbitrarily encoded: If the HTTP Content-Length specifies X bytes, the UTF8-decoded Body may have much less Characters (1 Character may be represented in UTF8 by 2 bytes, etc).

So what is a good parsing strategy to honor an explicitly specified Content-Length and/or a Transfer-Encoding: Chunked which specifies a chunk-length to be honored.

Upvotes: 2

Views: 340

Answers (2)

hotzen
hotzen

Reputation: 2873

I accumulate all Array[Byte] in an ArrayBuffer which allows me to count bytes. HTTP Protocol decoding (Status + Headers) is done by searching for the CRLF-position and then decoding 0 until CRLF with ISO8859.

Chunked Bodies are accumulated in the ArrayBuffer and only decoded with the specified charset if the chunk has been fully saved in the ArrayBuffer. This circumvents MALFORMED exceptions from the CharsetDecoder if decoding utf8 data which is split right in the middle of a 2-byte character.

For streaming HTML I have no good solution yet, normal HTML is buffered in the ArrayBuffer and decoded after the whole document has been received (like the chunks).

Upvotes: 0

Daniel C. Sobral
Daniel C. Sobral

Reputation: 297265

You could use UTF-16, which is Java's internal String representation anyway. It's 2 bytes for each character, except when there's a surrogate. So you could scan the string for surrogate characters up to the length allowed, account for them as appropriate, and just copy the substrings.

Upvotes: 1

Related Questions