Cláudio Pinto
Cláudio Pinto

Reputation: 63

C - How to avoid diacritic/accents sensitive issues

I'm creating a tiny program of guessing the capitals of countries. Some of the capitals have accents, cedillas, etc.

Since I have to compare the capital and the text the user guessed, and I don't want an accent to mess up the comparison, I went digging the internet for some way of accomplishing that.

I came across countless solutions to another programming languages however only a couple of results about C.

None of them actually worked with me. Although, I came to conclusion that I'd have to use the wchar.h library to deal with those annoying characters

I made this tiny bit of code (which replaces É with E) just to check this method and against all I read and understand it doesn't work, even printing the wide char string doesn't show diacritic characters. If it worked, I'm sure I could implement this on the capitals' program so I'd appreciate if someone can tell me what's wrong.

#include<stdio.h>
#include<locale.h>
#include<wchar.h>

const wchar_t CAPITAL_ACCUTE_E = L'\u00C9';

int main()
{
    wchar_t wbuff[128];
    setlocale(LC_ALL,"");
    fputws(L"Say something: ", stdout);
    fgetws(wbuff, 128, stdin);
    int n;
    int len = wcslen(wbuff);
    for(n=0;n<len;n++)
        if(wbuff[n] == CAPITAL_ACCUTE_E)
            wbuff[n] = L'E';
    wprintf(L"%ls\n", wbuff);
    return 0;
}

Upvotes: 4

Views: 477

Answers (1)

a3f
a3f

Reputation: 8657

An issue you overlooked is that É can be represented as

You need to account for this. This can be done by mapping both strings to the NFD (Normal Form: Decomposed). After that, you can strip away the decomposed combining characters and be left with the E, which you then can strcmp as usual.

Assuming you've got an UTF-8 encoded input, here is how you could do it with utf8proc:

#include <utf8proc.h>

utf8_t *output;
ssize_t len = utf8proc_map((uint8_t*)input, 0, &output, 
                           UTF8PROC_NULLTERM | UTF8PROC_STABLE |
                           UTF8PROC_STRIPMARK | UTF8PROC_DECOMPOSE |
                           UTF8PROC_CASEFOLD
                          );

This would turn all of É, É and E to a plain e.

Upvotes: 2

Related Questions