Kzwix
Kzwix

Reputation: 173

Easy way to read UTF-8 characters from a binary file?

Here is my problem: I have to read "binary" files, that is, files which have varying "record" sizes, and which may contain binary data, as well as UTF-8-encoded text fields.

Reading a given number of bytes from an input file is trivial, but I was wondering if there were functions to easily read a given number of characters (not bytes) from a file ? Like, if I know I need to read a 10-characters field (encoded in UTF-8, it would be at least 10 bytes long, but could go up to 40 or more, if we're talking "high" codepoints).

I emphasize that I'm reading a "mixed" file, that is, I cannot process it whole as UTF-8, because the binary fields have to be read without being interpreted as UTF-8 characters.

So, while doing it by hand is pretty straightforward (the byte-by-byte, naïve approach, isn't hard to implement - even though I'm dubious about the efficiency), I'm wondering if there are better alternatives out there. If possible, in the standard library, but I'm open to 3rd party code too - if my organization validates its use.

Upvotes: 0

Views: 1625

Answers (3)

Kzwix
Kzwix

Reputation: 173

Well, for now, I've settled on creating a function which allocates a buffer of size 4 * numberOfCharactersToRead + 1 (as a UTF-8 character is encoded at most on 4 bytes).

Then I fread() that much (or as much as I can, if I hit EOF). And then I merely test the upper bits to know whether I hit a 1-byte, 2-byte, 3-byte, or 4-byte character. I check the following bytes as needed, and note where it puts me.

After I read the required number of characters, I take note of the number of bytes it really took, and I then adjust back the file pointer if I had read more than needed. I also realloc() the buffer to downsize it to the needed size.

I'm pretty sure it's more efficient than calling getwc() repeatedly before converting the wchar_t back to UTF-8 (because, in the end, I need to keep it as a UTF-8 sequence, as I'm storing that data in a Perl scalar, and that's the way Perl does it internally).

I end the UTF-8 "string" I read with a 0 (hence the extra byte), in order to be able to print it with standard C functions, and that's that.

Also, to store the "raw binary" along with UTF-8 encoded text, when I concatenate them, I merely encode the binary bytes as UTF-8 codepoints. This way, under Perl, I get to treat a character or a "raw byte" the same way, as UTF-8 characters. I'll just have to get the "codepoint" value back when I need to work on a raw byte disguised as a character.

I know I hadn't mentioned Perl in the tags, but it didn't matter for the question, so I'm only mentioning it in order to provide some context as to why I went that way.

Thanks to all the people having posted helpful suggestions :)

Upvotes: 0

ikegami
ikegami

Reputation: 386706

You could also use something like this:

static unsigned char num_most_significant_ones[] = {
    /* 80 */   1, 1, 1, 1, 1, 1, 1, 1,   1, 1, 1, 1, 1, 1, 1, 1,
    /* 90 */   1, 1, 1, 1, 1, 1, 1, 1,   1, 1, 1, 1, 1, 1, 1, 1,
    /* A0 */   1, 1, 1, 1, 1, 1, 1, 1,   1, 1, 1, 1, 1, 1, 1, 1,
    /* B0 */   1, 1, 1, 1, 1, 1, 1, 1,   1, 1, 1, 1, 1, 1, 1, 1,
    /* C0 */   2, 2, 2, 2, 2, 2, 2, 2,   2, 2, 2, 2, 2, 2, 2, 2,
    /* D0 */   2, 2, 2, 2, 2, 2, 2, 2,   2, 2, 2, 2, 2, 2, 2, 2,
    /* E0 */   3, 3, 3, 3, 3, 3, 3, 3,   3, 3, 3, 3, 3, 3, 3, 3,
    /* F0 */   4, 4, 4, 4, 4, 4, 4, 4,   5, 5, 5, 5, 6, 6, 7, 8
};

static unsigned char lead_byte_data_mask[] = {
   0x7F, 0, 0x1F, 0x0F, 0x07, 0x03, 0x01
};

static int32_t min_by_len[] = {
   -1, 0x00, 0x80, 0x800, 0x10000ULL
}

// buf must be capable of accommodating at least 4 bytes.
// Returns 0 on EOF or read error.
size_t read_one_utf8_char(FILE* stream, char* buf) {
   int lead = getc(stream);
   if (lead == EOF)
      return 0;

   buf[0] = lead;
   if (lead < 0x80)
      return 1;

   unsigned len = num_most_significant_ones[ lead - 0x80 ];
   if (len == 1 || len > 6)
      goto ERROR;

   unsigned char mask = lead_byte_data_mask[len];
   uint32_t cp = lead & mask;
   for (int i=1; i<len; ++i) {
      int ch = getc(stream);  // Premature EOF or error.
      if (ch == EOF)
         goto ERROR;
      if ((ch & 0xC0) != 0x80) {  // Premature end of character.
         ungetc(ch, stream);
         goto ERROR;
      }
      cp = (cp << 6) | (ch & 0x3F);
      if (i < 4)
         buf[i] = ch;
   }

   if (len > 4 || cp < min_by_len[len] || ( cp >= 0xD800 && cp < 0xE000 ) || cp >= 0x110000)
      goto ERROR;

   return len;

ERROR:
   // Return U+FFFD.
   buf[0] = 0xEF;
   buf[1] = 0xBF;
   buf[2] = 0xBD;
   return 3;
}

Unlike getwc, this returns UTF-8.

Also, it validates, replacing illegal sequences with U+FFFD. (It doesn't replace noncharacters.[1][2]) I don't know if getwc does that.

Untested.

Upvotes: 1

Steve Summit
Steve Summit

Reputation: 48123

Here are two possibilities:

(1) If (but typically only if) your locale is set to handle UTF-8, the getwc function should read exactly one UTF-encoded Unicode character, even if it's multiple bytes long. So you could do something like

setlocale(LC_CTYPE, "UTF-8");
wint_t c;

for(i = 0; i < 10; i++) {
    c = getwc(ifp);
    /* do something with c */
}

Now, c here will be a single integer containing a Unicode codepoint, not a UTF-8 multibyte sequence. If (as is likely) you want to store UTF-8 strings in your in-memory data structure(s), you'd have to convert back to UTF-8, likely using wctomb.

(2) You could read N bytes from the input, then convert them to a wide character stream using mbstowcs. This isn't perfect, either, because it's hard to know what N should be, and the wide character string that mbstowcs gives you is, again, probably not what you want.

But before exploring either of these approaches, the question really is, what is the format of your input? Those UTF-encoded fragments of text, are they fixed-size, or does the file format contain an explicit count saying how big they are? And in either case, is their size specified in bytes, or in characters? Hopefully it's specified in bytes, in which case you don't need to do any conversion to/from UTF-8, you can just read N characters using fread. If the count is specified in terms of characters (which would be kind of weird, in my experience), you would probably have to use something like my approach (1) above.

Other than a loop like in (1) above, I don't know of a simple, encapsulated way to do the equivalent of "read N UTF-8 characters, no matter how many bytes it takes".

Upvotes: 1

Related Questions