Reading unicode characters from file in C

Question

I need to read Unicode characters from a file. The only thing I need to do from them is to extract their Unicode number. I am running on Windows XP using CodeBlock Mingw .

I am doing something like this

#define UNICODE
#ifdef UNICODE
#define _UNICODE
#else
#define _MBCS
#endif

    #include 
    #include 
    #include 
    int main()
    {
        wchar_t *filename=L"testunicode.txt";
        FILE *infile;
        infile=_wfopen(filename,L"r");
        wchar_t result=fgetwc(infile);
        wprintf(L"%d",result);//To verify the unicode of character stored in file,print it   
        return 0;
    }

I am getting result as 255 all the time.

testunicode.txt is stored in Encoding=Unicode (Created via notepad)

The final task is to read from a file which can contain characters from any language but wchar_t is 2 byte only so will it be able to get unicode for all possible characters of languages?

Need your help...

Thanks everyone for your reply.

Now I have changed the code.

#define UNICODE
#ifdef UNICODE
#define _UNICODE
#else
#define _MBCS
#endif

#include 
#include 
#include 
int main()
{
    wchar_t *filename=L"testunicode.txt";
    FILE *infile;
    infile=_wfopen(filename,L"r");
    wchar_t  b[2];
    fread(b,2,2,infile);//Read a character from the file
    wprintf(L"%d",b[1]);
    return 0;
}

It prints correct UTF 16 code. The project where it will be used requires to read characters from different languages of the world. So will UTF-16 will suffix or should we change the encoding of stored files to UTF-32? Also, here wchar_t is 2 bytes, for UTF-32 we need some data type with 4 bytes. How to accomplish that?

Thanks again for your reply........

Fr&#233;d&#233;ric Hamidi · Accepted Answer

Well, the code in your question only reads the first character of your file, so you will have to implement some kind of looping construct in order to process the whole contents of that file.

Now, fgetwc() is returning 255 (0xFF) for three reasons:

You're not taking the byte-order mark of the file into account, so you end up reading it instead of the actual file contents,
You're not specifying a translation mode flag in the mode argument to _wfopen(), so it defaults to text and fgetwc() accordingly tries to read a multibyte character instead of a wide character,
0xFF (the first byte of a little-endian UTF-16 BOM) is probably not a lead byte in your program's current code page, so fgetwc() returns it without further processing.

Reading unicode characters from file in C

Answers (1)

Related Questions