Reputation: 13693
I'm looking for a way of converting a wstring
into a plain string
containing only ASCII characters. Any character that isn't present in ASCII (0-127) should be converted to the closest ASCII character. If there is no similar ASCII character, the character should be omitted.
To illustrate, let's assume the following wide string:
wstring text(L"A naïve man called 晨 was having piña colada and crème brûlée.");
The converted version I'm looking for is this (notice the absence of diacritics):
string("A naive man called was having pina colada and creme brulee.")
Edit:
Regarding the purpose: I'm writing an application that analyzes English texts. The input files are UTF-8 and may contain special characters. A part of my application uses a library written in C that only understands ASCII. So I need a way of "dumbing down" the text to ASCII without losing too much information.
Regarding the precise requirements: Any character that is a diacritic version of an ASCII character should be converted to that ASCII character; all other characters should be omitted. So ı
, ĩ
, and î
should become i
because they are all versions of the small Latin letter i. The character ɩ
(iota), on the other hand, while visually similar, is not a version of the small Latin letter i and should thus be omitted.
Upvotes: 3
Views: 2578
Reputation: 5675
wstring
is a string
of wchar
which is a character that may have size of 2 or 4 bytes.
Meanwhile UTF8 is a variable length encoding with symbol size of 1-4 bytes. So your request is not fully consistent.
Assuming you've figured out how exactly data is stored in your strings I'd suggest you to check out ICU library to do further conversions.
You can normalize your strings and then remove all diacritics. But still you'll be left with Greek, Cyrillic and stuff. Or you can use transliteration feature which is more like what you're looking for.
The mindriot's solution is more concise but still you need to convert you wstring
to proper UTF8 sequence.
Upvotes: 0
Reputation: 5678
On GitHub, there is unidecode-cxx which is a (somewhat unfinished) C++ port of node-unidecode, which is in turn a JavaScript port of Perl's Text::Unicode. The C++ version is a bit rough around the edges, but the example in src/unidecode.cxx
can be modified to convert your example string,
A naïve man called 晨 was having piña colada and crème brûlée.
as follows:
A naive man called Chen was having pina colada and creme brulee.
In order to get the code to compile without Gyp (something I've never used and haven't had the time to figure out just now), I had to modify the code somewhat (quick and dirty):
Add #include <iostream>
to src/unidecode.cxx
, and add the following main
routine:
int main() {
string output_buf;
string input_buf = "A naïve man called 晨 was having piña colada and crème brûlée.";
unidecode(&input_buf, &output_buf);
cout << output_buf.c_str() << endl;
}
Replace all mentions of NULL
in src/data.cxx
with nullptr
Then I compiled with
g++ -std=c++11 -o unidecode unidecode.cxx
to get the desired result.
The code looks like a fairly primitive port and could do with some improvements, especially into more "proper" C++. It internally uses a statically compiled conversion table, which can probably be adapted to suit your needs if it does not.
Upvotes: 4