Graphyt
Graphyt

Reputation: 55

Convert UTF-8 characters to nearest equivalent ASCII characters using c++ (without winapi)

Does anybody have a code snippet what could convert at least the most common characters for the european languages? For example:

testáén

as a UTF-8 encoded string (i.e. bytes in hex: 74 65 73 74 c3 a1 c3 a9 6e 0)

to

testaen

(I'd like to use c/c++ and std, or small crossplatform libs)

Upvotes: 4

Views: 3914

Answers (3)

bames53
bames53

Reputation: 88225

Here's code that handles converting characters from the ISO-8859-1 range to ascii. A replacement character is used for everything else outside ascii.

#include <codecvt>
#include <array>
#include <string>

#include <iostream>

constexpr char const *rc = "?"; // replacement_char

// table mapping ISO-8859-1 characters to similar ASCII characters
std::array<char const *,96> conversions = {{
   " ",  "!","c","L", rc,"Y", "|","S", rc,"C","a","<<",   rc,  "-",  "R", "-",
    rc,"+/-","2","3","'","u", "P",".",",","1","o",">>","1/4","1/2","3/4", "?", 
   "A",  "A","A","A","A","A","AE","C","E","E","E", "E",  "I",  "I",  "I", "I",
   "D",  "N","O","O","O","O", "O","*","0","U","U", "U",  "U",  "Y",  "P","ss",
   "a",  "a","a","a","a","a","ae","c","e","e","e", "e",  "i",  "i",  "i", "i",
   "d",  "n","o","o","o","o", "o","/","0","u","u", "u",  "u",  "y",  "p", "y"    
}};

template <class Facet>
class usable_facet : public Facet {
public:
    using Facet::Facet;
    ~usable_facet() {}
};

std::string to_ascii(std::string const &utf8) {
    std::wstring_convert<usable_facet<std::codecvt<char32_t,char,std::mbstate_t>>,
                         char32_t> convert;
    std::u32string utf32 = convert.from_bytes(utf8);

    std::string ascii;
    for (char32_t c : utf32) {
        if (c<=U'\u007F')
            ascii.push_back(static_cast<char>(c));
        else if (U'\u00A0'<=c && c<=U'\u00FF')
            ascii.append(conversions[c - U'\u00A0']);
        else
            ascii.append(rc);
    }
    return ascii;
}

int main() {
    std::cout << to_ascii(u8"testáén\n");
}

Upvotes: 5

Nicol Bolas
Nicol Bolas

Reputation: 474376

I'd like to use c/c++ and std, or small crossplatform libs

Unfortunately, I'm not sure that a library exists that meets all of your criteria.

The smallest thing you're likely to find is iconv, and its UTF-8-to-ASCII converter may not do exactly what you want.

I'm pretty sure that ICU can do what you want, and while ICU is cross-platform, nobody has ever accused it of being small.

Upvotes: 3

bmargulies
bmargulies

Reputation: 100186

There is a gigantic collection of Unicode characters that you'd need to handle. So the criteria of 'small' is an impossible criteria. The ICU library contains what you need, but for this reason you won't find it small. You'll need, for example, to deal with both composed and non-composed modifiers.

If you really only care about a small subset of the possible Unicode characters, then you can create your own simple mapping table.

Upvotes: 4

Related Questions