Reputation: 23

Finding and comparing a Unicode charater in C++

I am writing a Lexical analyzer that parses a given string in C++. I have a string

line = R"(if n = 4 # comment
             return 34;  
             if n≤3 retur N1
          FI)";

All I need to do is output all words, numbers and tokens in a vector.

My program works with regular tokens, words and numbers; but I cannot figure out how to parse Unicode characters. The only Unicode characters my program needs to save in a vector are ≤ and ≠.

So far I all my code basically takes the string line by line, reads the first word, number or token, chops it off and recursively continues to eat tokens until the string is empty. I am unable to compare line[0] with ≠ (of course) and I am also not clear on how much of the string I need to chop off in order to get rid of the Unicode char? In case of "!=" I simple remove line[0] and line[1].

Upvotes: 0

Answers (2)

phuclv

Reputation: 41753

All Unicode encodings are variable-length except UTF-32. Therefore the next character isn't necessary a single char and you must read it as a string. Since you're using a char* or std::string, the encoding is likely UTF-8 and the next character and can be returned as std::string

The encoding of UTF-8 is very simple and you can read about it everywhere. In short, the first byte of a sequence will indicate how long that sequence is and you can get the next character like this:

std::string getNextChar(const std::string& str, size_t index)
{
    if (str[index] & 0x80 == 0)            // 1-byte sequence
        return std::string(1, str[index])
    else if (str[index] & 0xE0 == 0xC0)    // 2-byte sequence
        return std::string(&str[index], 2)
    else if (str[index] & 0xF0 == 0xE0)    // 3-byte sequence
        return std::string(&str[index], 3)
    else if (str[index] & 0xF8 == 0xF0)    // 4-byte sequence
        return std::string(&str[index], 4)
    throw "Invalid codepoint!";
}

It's a very simple decoder and doesn't handle invalid codepoints or broken datastream yet. If you need better handling you'll have to use a proper UTF-8 library

Upvotes: 2

HAL9000

Reputation: 2188

If your input-file is utf8, just treat your unicode characters ≤, ≠, etc as strings. So you just have to use the same logic to recognize "≤" as you would for "<=". The length of a unicode char is then given by strlen("≤")

Upvotes: 3

Finding and comparing a Unicode charater in C++

Answers (2)

Related Questions