hazrmard
hazrmard

Reputation: 3661

wchar_t variables only store half of an Urdu character in C

I am trying to read and manipulate Urdu text from files. However it seems that a character is not read whole into the wchar_t variable. Here is my code that reads text and prints each character in a new line:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

void main(int argc, char* argv[]) {
    setlocale(LC_ALL, "");
    printf("This program tests Urdu reading:\n");
    wchar_t c;
    FILE *f = fopen("urdu.txt", "r");
    while ((c = fgetwc(f)) != WEOF) {
        wprintf(L"%lc\n", c);
    }
    fclose(f);
}

And here is my sample text:

میرا نام ابراھیم ھے۔

میں وینڈربلٹ یونیورسٹی میں پڑھتا ھوں۔

However there seem to be twice as many characters printed as there are letters in the text. I understand that wide or multi-byte characters use multiple bytes, but I thought that the wchar_t type would store all the bytes corresponding to a letter in the alphabet together.

How can I read the text so that at any one time, I have a whole character stored in a variable?

Details about my environment:
gcc: (x86_64-posix-seh-rev0, Built by MinGW-W64 project) 5.3.0
OS: Windows 10 64 bit
Text file encoding: UTF-8

This is how my text looks in hex format:

d9 85 db 8c d8 b1 d8 a7 20 d9 86 d8 a7 d9 85 20 d8 a7 d8 a8 d8 b1 d8 a7 da be db 8c d9 85 20 da be db 92 db 94 ad 98 5d b8 cd ab a2 0d 98 8d b8 cd 98 6d a8 8d 8b 1d 8a 8d 98 4d 9b 92 0d b8 cd 98 8d 98 6d b8 cd 98 8d 8b 1d 8b 3d 9b 9d b8 c2 0d 98 5d b8 cd ab a2 0d 9b ed a9 1d ab ed 8a ad 8a 72 0d ab ed 98 8d ab ad b9 4a

Upvotes: 0

Views: 145

Answers (2)

hazrmard
hazrmard

Reputation: 3661

UTF-8 is encoding for Unicode that takes from 1-4 bytes per character. I was able to store each unicode character in a uint32_t (or u_int32_t on some UNIX platforms) variable. The library I used is (utf8.h | utf8.c). It provides some conversion and manipulation functions for UTF-8 strings.

So if a file is n bytes in UTF-8, at most it will have n Unicode characters. Which means I need a memory of 4*n bytes (4 bytes per u_int32_t variable) to store the contents of the file.

#include "utf8.h"

// here read contents of file into a char* => buff
// keep count of # of bytes read => N

ubuff = (u_int32_t*) calloc(N, sizeof(u_int32_t));  // calloc initializes to 0
u8_toucs(ubuff, N, buff, N);

// ubuff now is an array of 4-byte integers representing
// a Unicode character each

Of course, it is entirely possible that there will be less than n Unicode characters in the file if multiple bytes represent a single character. This means that the 4*n memory allocation is too much. In that case a chunk of ubuff will be 0 (Unicode Null character). So I simply scan the array and reallocate memory as needed:

u_int32_t* original = ubuff;
int sz=0;
while *ubuff != 0 {
    ubuff++;
    sz++;
}
ubuff = realloc(original, sizeof(*original) * i);

Note: If you get type errors about u_int32_t, put typedef uint32_t u_int32_t; at the beginning of your code.

Upvotes: 0

n. m. could be an AI
n. m. could be an AI

Reputation: 119847

Windows support for Unicode is mostly proprietary and it is impossible to write portable software that uses UTF-8 and works on Windows using Windows native libraries. If you are willing to consider non-portable solutions, here is one:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>
#include <fcntl.h>

void main(int argc, char* argv[]) {
    setlocale(LC_ALL, "");

    // Next line is needed to output wchar_t data to the console. Note that 
    // Urdu characters are not supported by standard console fonts. You may
    // have to install appropriate fonts to see Urdu on the console.
    // Failing that, redirecting to a file and opening with a text editor
    // should show Urdu characters.

    _setmode(_fileno(stdout), _O_U16TEXT);

    // Mixing wide-character and narrow-character output to stdout is not
    // a good idea. Using wprintf throughout. (Not Windows-specific)

    wprintf(L"This program tests UTF-8 reading:\n");

    // WEOF is not guaranteed to fit into wchar_t. It is necessary
    // to use wint_t to keep a result of fgetwc, or to print with
    // %lc. (Not Windows-specific)

    wint_t c;

    // Next line has a non-standard parameter passed to fopen, ccs=...
    // This is a Windows way to support different file encodings.
    // There are no UTF-8 locales in Windows. 

    FILE *f = fopen("urdu.txt", "r,ccs=UTF-8");

    while ((c = fgetwc(f)) != WEOF) {
        wprintf(L"%lc", c);
    }
    fclose(f);
}

OTOH with glibc (e.g. using cygwin) these Windows extensions are not needed because glibc handles these things internally.

Upvotes: 1

Related Questions