yash.agarwal
yash.agarwal

Reputation: 21

substr on std::string doesn't work correctly due to presence of some characters that are invisible, but look like spaces

I have a std::string that contains characters that I am unable to see, like \xc2, etc.

I want substring of my string which is not working correctly due to presence of characters like '    ' When I replace it with spaces in ' ', the substring is giving correct answers. Although this problem has been solved, I don't want any other character to mess this up. How do I root out this problem? [I just want to replace all those unnecessary characters with spaces.]

Upvotes: 1

Views: 542

Answers (2)

Enoc Martinez
Enoc Martinez

Reputation: 165

You can convert this string to std::u16string using std::wstring_convert<std::codecvt_utf8_utf16<char16_t>,char16_t>.

Example:

    #include <codecvt>

    //Something...

    std::string hello = "H€llo World"; 
    std::wstring_convert<std::codecvt_utf8_utf16<char16_t>,char16_t> convert;
    std::u16string hello_word_u16 = convert.from_bytes(hello); 
    std::string hello_world_u8 = convert.to_bytes(hello_word_u16);

Using u16 (char16_t) you don't need to care about double byte encoding characters.

Upvotes: 1

Sorin
Sorin

Reputation: 11968

Your text is most likely UTF-8 unicode (this is the most common encoding these days). \xc2 is part of multi-byte encoding of likely "No-Break Space" (c2 a0) character or something similar. std::string and substring operates on bytes and is completely unaware that you have unicode and that certain pairs of bytes shouldn't be split. You will also get incorrect character count, incorrect capitalization and other strange effects.

The proper way to handle this is to use a library that implements unicode correctly. And this means replacing all strings in your program with unicode aware variants.

I know it's a bit of work, but the alternative is that you fix this place today and tomorrow you find another operation somewhere else that does things wrong.

Upvotes: 4

Related Questions