Reputation: 970
I have a simple question that I can't find anywhere over the internet, how can I convert UTF-8 to ASCII (mostly accented characters to the same character without accent) in C using only the standard lib? I found solutions to most of the languages out there, but not for C particularly.
Thanks!
EDIT: Some of the kind guys that commented made me double check what I needed and I exaggerated. I only need an idea on how to make a function that does: char with accent -> char without accent. :)
Upvotes: 6
Views: 9762
Reputation: 215567
Since this is homework, I'm guessing your teacher is clueless and doesn't know anything about UTF-8, and probably is stuck in the 1980s with "code pages" and "extended ASCII" (words you should erase from your vocabulary if you haven't already). Your teacher probably wants you to write a 128-byte lookup table that maps CP437 or Windows-1252 bytes in the range 128-255 to similar-looking ASCII letters. It would go something like...
void strip_accents(unsigned char *dest, const unsigned char *src)
{
static const unsigned char lut[128] = { /* mapping here */ };
do {
*dest++ = *src < 128 ? *src : lut[*src];
} while (*src++);
}
Upvotes: 2
Reputation: 942328
Every decent Unicode support library (not the standard library of course) has a way to decompose a string in KC or KD form. Which separates the diacritics from the letters. Giving you a shot at filtering them out. Not so sure this is worth pursuing, the result is just gibberish to the native language reader and not every letter is decomposable. In other words, junk with question marks.
Upvotes: 2
Reputation: 24561
In general, you can't. UTF-8 covers much more than accented characters.
Upvotes: 4
Reputation: 104125
Take a look at libiconv. Even if you insist on doing it without libraries, you might find an inspiration there.
Upvotes: 5
Reputation: 106609
There's no built in way of doing that. There's really little difference between UTF-8 and ASCII unless you're talking about high level characters, which cannot be represented in ASCII anyway.
If you have a specific mapping you want (such as a with accent -> a) then you should just probably handle that as a string replace operation.
Upvotes: 2