Reputation: 3661
I am trying to read and manipulate Urdu text from files. However it seems that a character is not read whole into the wchar_t
variable. Here is my code that reads text and prints each character in a new line:
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
void main(int argc, char* argv[]) {
setlocale(LC_ALL, "");
printf("This program tests Urdu reading:\n");
wchar_t c;
FILE *f = fopen("urdu.txt", "r");
while ((c = fgetwc(f)) != WEOF) {
wprintf(L"%lc\n", c);
}
fclose(f);
}
And here is my sample text:
میرا نام ابراھیم ھے۔
میں وینڈربلٹ یونیورسٹی میں پڑھتا ھوں۔
However there seem to be twice as many characters printed as there are letters in the text. I understand that wide or multi-byte characters use multiple bytes, but I thought that the wchar_t
type would store all the bytes corresponding to a letter in the alphabet together.
How can I read the text so that at any one time, I have a whole character stored in a variable?
Details about my environment:
gcc: (x86_64-posix-seh-rev0, Built by MinGW-W64 project) 5.3.0
OS: Windows 10 64 bit
Text file encoding: UTF-8
This is how my text looks in hex format:
d9 85 db 8c d8 b1 d8 a7 20 d9 86 d8 a7 d9 85 20 d8 a7 d8 a8 d8 b1 d8 a7 da be db 8c d9 85 20 da be db 92 db 94 ad 98 5d b8 cd ab a2 0d 98 8d b8 cd 98 6d a8 8d 8b 1d 8a 8d 98 4d 9b 92 0d b8 cd 98 8d 98 6d b8 cd 98 8d 8b 1d 8b 3d 9b 9d b8 c2 0d 98 5d b8 cd ab a2 0d 9b ed a9 1d ab ed 8a ad 8a 72 0d ab ed 98 8d ab ad b9 4a
Upvotes: 0
Views: 145
Reputation: 3661
UTF-8 is encoding for Unicode that takes from 1-4 bytes per character. I was able to store each unicode character in a uint32_t (or u_int32_t on some UNIX platforms) variable. The library I used is (utf8.h | utf8.c). It provides some conversion and manipulation functions for UTF-8 strings.
So if a file is n bytes in UTF-8, at most it will have n Unicode characters. Which means I need a memory of 4*n bytes (4 bytes per u_int32_t variable) to store the contents of the file.
#include "utf8.h"
// here read contents of file into a char* => buff
// keep count of # of bytes read => N
ubuff = (u_int32_t*) calloc(N, sizeof(u_int32_t)); // calloc initializes to 0
u8_toucs(ubuff, N, buff, N);
// ubuff now is an array of 4-byte integers representing
// a Unicode character each
Of course, it is entirely possible that there will be less than n Unicode characters in the file if multiple bytes represent a single character. This means that the 4*n memory allocation is too much. In that case a chunk of ubuff
will be 0 (Unicode Null character). So I simply scan the array and reallocate memory as needed:
u_int32_t* original = ubuff;
int sz=0;
while *ubuff != 0 {
ubuff++;
sz++;
}
ubuff = realloc(original, sizeof(*original) * i);
Note: If you get type errors about u_int32_t
, put typedef uint32_t u_int32_t;
at the beginning of your code.
Upvotes: 0
Reputation: 119847
Windows support for Unicode is mostly proprietary and it is impossible to write portable software that uses UTF-8 and works on Windows using Windows native libraries. If you are willing to consider non-portable solutions, here is one:
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
#include <fcntl.h>
void main(int argc, char* argv[]) {
setlocale(LC_ALL, "");
// Next line is needed to output wchar_t data to the console. Note that
// Urdu characters are not supported by standard console fonts. You may
// have to install appropriate fonts to see Urdu on the console.
// Failing that, redirecting to a file and opening with a text editor
// should show Urdu characters.
_setmode(_fileno(stdout), _O_U16TEXT);
// Mixing wide-character and narrow-character output to stdout is not
// a good idea. Using wprintf throughout. (Not Windows-specific)
wprintf(L"This program tests UTF-8 reading:\n");
// WEOF is not guaranteed to fit into wchar_t. It is necessary
// to use wint_t to keep a result of fgetwc, or to print with
// %lc. (Not Windows-specific)
wint_t c;
// Next line has a non-standard parameter passed to fopen, ccs=...
// This is a Windows way to support different file encodings.
// There are no UTF-8 locales in Windows.
FILE *f = fopen("urdu.txt", "r,ccs=UTF-8");
while ((c = fgetwc(f)) != WEOF) {
wprintf(L"%lc", c);
}
fclose(f);
}
OTOH with glibc (e.g. using cygwin) these Windows extensions are not needed because glibc handles these things internally.
Upvotes: 1