João Pedro Voga
João Pedro Voga

Reputation: 91

How do I read from a file in C if the file has accented chararcters such as 'á'?

Another day, another problem with strings in C. Let's say I have a text file named fileR.txt and I want to print its contents. The file goes like this:

Letter á
Letter b
Letter c
Letter ê

I would like to read it and show it on the screen, so I tried the following code:

#include <stdlib.h>
#include <locale.h>
#include <clocale>
#include <stdio.h>
#include <conio.h>
#include <wchar.h>

int main()
{
    FILE *pF;
    char line[512]; // Current line

    setlocale(LC_ALL, "");
    pF = fopen("Aulas\\source\\fileR.txt", "r");

    while (!feof(pF))
    {
        fgets(line, 512, pF);
        fputs(line, stdout);
    }

    return 0;
}

And the output was:

Letter á
Letter b
Letter c
Letter ê

I then attempted to use wchar_t to do it:

#include <stdlib.h>
#include <locale.h>
#include <clocale>
#include <stdio.h>
#include <conio.h>
#include <wchar.h>

int main()
{
    FILE *pF;
    wchar_t line[512]; // Current line

    setlocale(LC_ALL, "");
    pF = fopen("Aulas\\source\\fileR.txt", "r");

    while (!feof(pF))
    {
        fgetws(line, 512, pF);
        fputws(line, stdout);
    }

    return 0;
}

The output was even worse:

Letter ÃLetter b
Letter c
Letter Ã

I have seen people suggesting the use of an unsigned char array, but that simply results in an error, as the stdio functions made for input and output take signed char arrays, and even if i were to write my own funtion to print an array of unsigned chars, I would not know how to be able to read something from a file as unsigned.

So, how can I read and print a file with accented characters in C?

Upvotes: 0

Views: 355

Answers (1)

Dweeberly
Dweeberly

Reputation: 4777

The problem you are having is not in your code, it's in your expectations. A text character is really just a value that has been associated with some form of glyph (symbol). There are different schemes for making this association, generally referred to as encodings. One early and still common encoding is known as ASCII (American Standard Code for Information Interchange). As the name implies it is American English centric. Originally this was a 7 bit encoding (128 values), but later was extended to include other symbols using 8 bits. Other encoding were developed for other languages. This was non-optimal. The Unicode standard was developed to address this. It's a relatively complicated standard designed to include any symbols one might want to encode. Unicode has various schemes that trade off data size for character size, for example UTF7, UTF8, UTF16 and UTF32. Because of this there will not necessarily be a one to one relationship between a byte and a character.

So different character representations have different values and those values can be greater than a single byte. The next problem is that to display the associated glyphs you need to have a system that correctly maps the value to the glyph and is able to display said glyph. A lot of "terminal" applications don't support Unicode by default. They use ASCII or Extended ASCII. It looks like that is what you may be using. The terminal is making the assumption that each byte it needs to display corresponds a single character (which as discussed isn't necessarily true in Unicode).

One thing to try is to redirect your output to a file and use a Unicode aware editor (like notepad++) to view the file using a UTF8 (for example) encoding. You can also hex dump the input file to see how it has been encoded. Sometimes Unicode files are written with BOM (Byte Order Mark) to help identify the Unicode encoding and byte order in play.

Upvotes: 1

Related Questions