Reputation: 63
I'm creating a tiny program of guessing the capitals of countries. Some of the capitals have accents, cedillas, etc.
Since I have to compare the capital and the text the user guessed, and I don't want an accent to mess up the comparison, I went digging the internet for some way of accomplishing that.
I came across countless solutions to another programming languages however only a couple of results about C.
None of them actually worked with me. Although, I came to conclusion that I'd have to use the wchar.h library to deal with those annoying characters
I made this tiny bit of code (which replaces É with E) just to check this method and against all I read and understand it doesn't work, even printing the wide char string doesn't show diacritic characters. If it worked, I'm sure I could implement this on the capitals' program so I'd appreciate if someone can tell me what's wrong.
#include<stdio.h>
#include<locale.h>
#include<wchar.h>
const wchar_t CAPITAL_ACCUTE_E = L'\u00C9';
int main()
{
wchar_t wbuff[128];
setlocale(LC_ALL,"");
fputws(L"Say something: ", stdout);
fgetws(wbuff, 128, stdin);
int n;
int len = wcslen(wbuff);
for(n=0;n<len;n++)
if(wbuff[n] == CAPITAL_ACCUTE_E)
wbuff[n] = L'E';
wprintf(L"%ls\n", wbuff);
return 0;
}
Upvotes: 4
Views: 477
Reputation: 8657
An issue you overlooked is that É
can be represented as
É
- LATIN CAPITAL LETTER E WITH ACUTE, codepoint U+00C9 (c3 89
in UTF-8), orÉ
- LATIN CAPITAL LETTER E followed by COMBINING ACUTE ACCENT, codepoints U+0045 U+0301 (45 cc 81
in UTF-8)You need to account for this. This can be done by mapping both strings to the NFD (Normal Form: Decomposed). After that, you can strip away the decomposed combining characters and be left with the E
, which you then can strcmp
as usual.
Assuming you've got an UTF-8 encoded input
, here is how you could do it with utf8proc:
#include <utf8proc.h>
utf8_t *output;
ssize_t len = utf8proc_map((uint8_t*)input, 0, &output,
UTF8PROC_NULLTERM | UTF8PROC_STABLE |
UTF8PROC_STRIPMARK | UTF8PROC_DECOMPOSE |
UTF8PROC_CASEFOLD
);
This would turn all of É
, É
and E
to a plain e
.
Upvotes: 2