Reputation: 41433
I'm writing a custom cross-platform minimalistic TCP server in plain C89. (But I will also accept POSIX-specific answer.)
The server works with UTF-8 strings, but never looks inside them. It treats all strings as immutable binary blobs.
But now I need to accept UTF-8 strings from the client that does not know how to calculate their size in bytes. The client can only transmit string length in characters. (Update: The client is in JavaScript, and "length in characters" is, in fact, whatever String.length()
returns. I assume it is actual UTF-8 characters, not something else.)
I do not want to add heavy dependencies to my tiny server. Is there a robust and neat way to read this datagram? (For the sake of this question, let's say that it is read from FILE *
.)
U<CRLF> ; data type marker (actually read by dispatching code)
<SIZE><CRLF> ; UTF-8 string size in characters
<DATA><CRLF> ; data blob
Example:
U
7
Юникод!
Update:
One batch of data can contain more than one datagram, so approximate reads would not work, I need to read exact amount of characters.
And the actual UTF-8 data may contain any characters, so I can't pick a character as a terminator — I don't want mess with escaping it in the data.
Upvotes: 6
Views: 1659
Reputation: 41433
This looks like exactly the thing I'd need. Wish I found it earlier:
http://bjoern.hoehrmann.de/utf-8/decoder/dfa/
Upvotes: 0
Reputation: 324
Well, the length property of JavaScript strings seems to count codepoints, not characters, as you can see (but wait! it's not quite codepoints):
> s1='\u0061\u0301'
'á'
> s2='\u00E1'
'á'
> s1.length
2
> s2.length
1
>
Although that's with V8. Looking around it seems that that's actually what the ECMAScript standard requires:
Also, checking ECMA-262, on pages 40-41 of the PDF it says "The length of a String is the number of elements (i.e., 16-bit values) within it", and then goes on to make clear that the elements are UTF-16 units. Sadly that's not quite "codepoints". Basically, this makes the string length property rather useless. Looking around I find this:
How can I tell if a string contains multibyte characters in Javascript?
Upvotes: 2
Reputation: 324
Characters? Or codepoints? The two are not the same. Unicode is... complex. You could count all of these different things about a UTF-8 string: length in bytes, length in codepoints, length in characters, length in glyphs, and length in grapheme clusters. All of those might come out different for any given string!
My first inclination is to tell that broken client to go away. But assuming you can't do that you need to ask what exactly the client is counting. The simplest thing to count, after bytes, is codepoints -- that's what UTF-8 encodes, after all. After that? characters, but you need to have tables of composing codepoints so that you can identify sequences of codepoints that make up a character. If the client counts glyphs or grapheme clusters then you're in for a world of hurt. But most likely the client counts either codepoints or characters. If it counts codepoints then just count bytes with with binary values 10xxxxxx and 0xxxxxxx (though you probably want to implement enough UTF-8 to protect against overlong sequences). If it counts characters then you need to identify combining marks and count them as part of the associated non-combining codepoint.
Upvotes: 1
Reputation: 6573
If the DATA can't contain a CRLF, it seems that you could use the CRLF as a framing delimiter. Just ignore the SIZE and read until CRLF.
Upvotes: 0
Reputation: 225052
If the length you get doesn't match the number of bytes you get, you have a couple of choices.
Read one byte at a time and assemble them into characters until you get matching number of characters.
Add a known terminator and skip the string size entirely. Just read one byte at a time until you read the terminator sequence.
Read a the number of bytes listed in the header (since that's the minimum number). Figure out if you have enough characters. If not, read some more!
Upvotes: 0
Reputation: 437574
It's pretty easy to write a UTF-8 "reader" given the information here; UTF-8 was designed so tasks like this one would be easy.
In essence, you start reading characters until you read as many as the client tells you. You know that you 've read a whole character given the UTF-8 encoding definition, specifically:
If the character is encoded by just one byte, the high-order bit is 0 and the other bits give the code value (in the range 0..127). If the character is encoded by a sequence of more than one byte, the first byte has as many leading '1' bits as the total number of bytes in the sequence, followed by a '0' bit, and the succeeding bytes are all marked by a leading "10" bit pattern.
Upvotes: 9