raharaha
raharaha

Reputation: 117

Compare iterated character from std::string with unicode C++

i've been struggling with this problem for quite some time and this is my first for dealing with unicode or UTF-8 basically.

This is what i'm trying to do, I just want to iterate a std::string containing with combination from normal alphabet and a unicode symbol, which in my case is en dash "–". more info: http://www.fileformat.info/info/unicode/char/2013/index.htm

this is the code that i've tried and it won't run:

#include <iostream>
#include <string>

int main()
{
    std::string str = "test string with symbol – and !";
    for (auto &letter : str) {
        if (letter == "–") {
            std::cout << "found!" << std::endl;
        }
    }
    return 0;
}

This is the result of my compiler:

main.cpp: In function 'int main()':
main.cpp:18:23: error: ISO C++ forbids comparison between pointer and 
integer [-fpermissive]
     if (letter == "–") {
                   ^

also, when i was looking through the internet i found an interesting information for this type of task that i need to solve. How to search a non-ASCII character in a c++ string?

But when i tried to modified my code with those UTF-8 hex code, it also won't run:

    if (letter == "\xE2\x80\x93") {
        std::cout << "found!" << std::endl;
    }

with the exact same message from my compiler, which is c++ forbids comparison between pointer and integer.

Did i miss something? or do i need to use libraries like ICU or Boost? Your help is much appreciated. thank you!

Update

based on the answer from UnholySheep, i've been improving my code but it's still cannot work. it can pass the compiling but when i tried to run it, it can't ouput "found!" to out. so, how do i solve this? thank you.

Upvotes: 1

Views: 2481

Answers (2)

Serge Ballesta
Serge Ballesta

Reputation: 148910

As said in comment by UnholySheep, a char literal "–" is a char array. Assuming an utf8 representation, char em_dash = "–"; is the same as char em_dash = {'\xe2', '\x80', '\x93'};.

You can only find true characters with your current code. For example this would work correctly:

...
if (letter == '!')
...

because '!' is a char constant.

If you only want to process unicode characters in the Basic Multilingual Plane (code below 0xFFFF), using wide characters should be enough as proposed in @ArashMohammadi's answer. An alternate solution for characters outside the BMP like emoji chars would be to use a std::u32string in which every unicode char is represented by a single char32_t character.

If you want to directly process a UTF8 encoded string of single byte characters, you will have to use the compare method:

std::string em_dash = "–"; // or "\xe2\x80\x93"
...
    for (size_t pos=0; pos <= str.size() - em_dash.size(); pos++) {
        if (str.compare(pos, em_dash.size(), em_dash()) == 0) {
            std::cout << "found!" << std::endl;
        }
    }
...

or directly use the find method:

...
    if (str.find(em_dash) != str.npos) {
        std::cout << "found!" << std::endl;
    }
...

Upvotes: 1

Arash
Arash

Reputation: 2164

How about this code?

#include <iostream>
#include <string>

int main()
{
    std::wstring str = L"test string with symbol – and !";
    for (auto &letter : str) {
        if (letter == L'–') {
            std::cout << "found!" << std::endl;
        }
    }
    return 0;
}

Upvotes: 2

Related Questions