diogocarmo
diogocarmo

Reputation: 970

UTF-8 -> ASCII in C language

I have a simple question that I can't find anywhere over the internet, how can I convert UTF-8 to ASCII (mostly accented characters to the same character without accent) in C using only the standard lib? I found solutions to most of the languages out there, but not for C particularly.

Thanks!

EDIT: Some of the kind guys that commented made me double check what I needed and I exaggerated. I only need an idea on how to make a function that does: char with accent -> char without accent. :)

Upvotes: 6

Views: 9762

Answers (5)

R.. GitHub STOP HELPING ICE
R.. GitHub STOP HELPING ICE

Reputation: 215567

Since this is homework, I'm guessing your teacher is clueless and doesn't know anything about UTF-8, and probably is stuck in the 1980s with "code pages" and "extended ASCII" (words you should erase from your vocabulary if you haven't already). Your teacher probably wants you to write a 128-byte lookup table that maps CP437 or Windows-1252 bytes in the range 128-255 to similar-looking ASCII letters. It would go something like...

void strip_accents(unsigned char *dest, const unsigned char *src)
{
    static const unsigned char lut[128] = { /* mapping here */ };
    do {
        *dest++ = *src < 128 ? *src : lut[*src];
    } while (*src++);
 }

Upvotes: 2

Hans Passant
Hans Passant

Reputation: 942328

Every decent Unicode support library (not the standard library of course) has a way to decompose a string in KC or KD form. Which separates the diacritics from the letters. Giving you a shot at filtering them out. Not so sure this is worth pursuing, the result is just gibberish to the native language reader and not every letter is decomposable. In other words, junk with question marks.

Upvotes: 2

Nemanja Trifunovic
Nemanja Trifunovic

Reputation: 24561

In general, you can't. UTF-8 covers much more than accented characters.

Upvotes: 4

zoul
zoul

Reputation: 104125

Take a look at libiconv. Even if you insist on doing it without libraries, you might find an inspiration there.

Upvotes: 5

Billy ONeal
Billy ONeal

Reputation: 106609

There's no built in way of doing that. There's really little difference between UTF-8 and ASCII unless you're talking about high level characters, which cannot be represented in ASCII anyway.

If you have a specific mapping you want (such as a with accent -> a) then you should just probably handle that as a string replace operation.

Upvotes: 2

Related Questions