Adrian
Adrian

Reputation: 2365

Extract (first) UTF-8 character from a std::string

I need to use a C++ implementation of PHP's mb_strtoupper function to imitate Wikipedia's behavior.

My problem is, that I want to feed only a single UTF-8 character to the function, namely the first of a std::string.

std::string s("äbcdefg");
mb_strtoupper(s[0]); // this obviously can't work with multi-byte characters
mb_strtoupper('ä'); // works

Is there an efficient way to detect/return only the first UTF-8 character of a string?

Upvotes: 7

Views: 2270

Answers (2)

Adrian McCarthy
Adrian McCarthy

Reputation: 47952

In UTF-8, the high bits of the first byte tell you how many subsequent bytes are part of the same code point.

0b0xxxxxxx: this byte is the entire code point
0b10xxxxxx: this byte is a continuation byte - this shouldn't occur at the start of a string
0b110xxxxx: this byte plus the next (which must be a continuation byte) form the code point
0b1110xxxx: this byte plus the next two form the code point
0b11110xxx: this byte plus the next three form the code point

The pattern can be assumed to continue, but I don't think valid UTF-8 ever uses more than four bytes to represent a single code point.

If you write a function that counts the number of leading bits set to 1, then you can use it to figure out where to split the byte sequence in order to isolate the first logical code point, assuming the input is valid UTF-8. If you want to harden against invalid UTF-8, you'd have to write a bit more code.

Another way to do it is to take advantage of the fact that continuation bytes always match the pattern 0b10xxxxxx, so you take the first byte, and then keep taking bytes as long as the next byte matches that pattern.

std::size_t GetFirst(const std::string &text) {
  if (text.empty()) return 0;
  std::size_t length = 1;
  while ((text[length] & 0b11000000) == 0b10000000) {
    ++length;
  }
  return length;
}

For many languages, a single code point usually maps to a single character. But what people think of as single characters may be closer to what Unicode calls a grapheme cluster, which is one or more code points that combine to produce a glyph.

In your example, the ä can be represented in different ways: It could be the single code point U+00E4 LATIN SMALL LETTER A WITH DIAERESIS or it could be a combination of U+0061 LATIN SMALL LETTER A and U+0308 COMBINING DIAERESIS. Fortunately, just picking the first code point should work for your goal to capitalize the first letter.

If you really need the first grapheme cluster, you have to look beyond the first code point to see if the next one(s) combine with it. For many languages, it's enough to know which code points are "non-spacing" or "combining" or variant selectors. For some complex scripts (e.g., Hangul?), you might need to turn to this Unicode Consortium technical report.

Upvotes: 8

Mish7913
Mish7913

Reputation: 35

Library str.h

#include <iostream>
#include "str.h"

int main (){
    std::string text = "äbcdefg";
    std::string str = str::substr(text, 0, 1); // Return:~ ä
    std::cout << str << std::endl;
}

Upvotes: 1

Related Questions