How to convert wide string to ASCII

Question

I'm looking for a way of converting a wstring into a plain string containing only ASCII characters. Any character that isn't present in ASCII (0-127) should be converted to the closest ASCII character. If there is no similar ASCII character, the character should be omitted.

To illustrate, let's assume the following wide string:

wstring text(L"A naïve man called 晨 was having piña colada and crème brûlée.");

The converted version I'm looking for is this (notice the absence of diacritics):

string("A naive man called  was having pina colada and creme brulee.")

Edit:

Regarding the purpose: I'm writing an application that analyzes English texts. The input files are UTF-8 and may contain special characters. A part of my application uses a library written in C that only understands ASCII. So I need a way of "dumbing down" the text to ASCII without losing too much information.

Regarding the precise requirements: Any character that is a diacritic version of an ASCII character should be converted to that ASCII character; all other characters should be omitted. So ı, ĩ, and î should become i because they are all versions of the small Latin letter i. The character ɩ (iota), on the other hand, while visually similar, is not a version of the small Latin letter i and should thus be omitted.

mindriot · Accepted Answer

On GitHub, there is unidecode-cxx which is a (somewhat unfinished) C++ port of node-unidecode, which is in turn a JavaScript port of Perl's Text::Unicode. The C++ version is a bit rough around the edges, but the example in src/unidecode.cxx can be modified to convert your example string,

A naïve man called 晨 was having piña colada and crème brûlée.

as follows:

A naive man called Chen was having pina colada and creme brulee.

In order to get the code to compile without Gyp (something I've never used and haven't had the time to figure out just now), I had to modify the code somewhat (quick and dirty):

Add #include to src/unidecode.cxx, and add the following main routine:

int main() {
  string output_buf;
  string input_buf = "A naïve man called 晨 was having piña colada and crème brûlée.";
  unidecode(&input_buf, &output_buf);
  cout << output_buf.c_str() << endl;
}

Replace all mentions of NULL in src/data.cxx with nullptr

Then I compiled with

g++ -std=c++11 -o unidecode unidecode.cxx

to get the desired result.

The code looks like a fairly primitive port and could do with some improvements, especially into more "proper" C++. It internally uses a statically compiled conversion table, which can probably be adapted to suit your needs if it does not.

How to convert wide string to ASCII

Answers (2)

Related Questions