Reputation: 319
I have to write config info to a file in Linux, while the config info contains Chinese characters.
Instead of using wchar_t
,I just using char array, is this correct?
Here is my code :
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>
#define MSG_LEN 4096
int save_config_info(const char *path, char* message)
{
FILE *fp = NULL;
fp = fopen(path, "wb");
if (!fp)
{
//print error message
return -1;
}
if (fwrite(message, 1, strlen(message), fp) != strlen(message))
{
//print error message
fclose(fp);
return -1;
}
fclose(fp);
return 0;
}
int main()
{
//config contain chinese character
char str[MSG_LEN] = "配置文件中包含中文";
char path[PATH_MAX] = "example.txt";
save_config_info(path,str);
return 0;
}
If the source code encoding is ISO-8859-1, generate the example.txt and using cat to show with some????.
But change the source code encoding with utf-8, everything works well.
My question is:
Is there any elegant way to deal with the Chinese character, since I cannot ensure the source file encoding.
I want the example.txt looks always right.
[root workspace]#file fork.c
fork.c: C source, ASCII text
[root workspace]#gcc -g -o fork fork.c
[root workspace]#
[root workspace]#./fork
[root workspace]#
[root workspace]#
[root workspace]#file example.txt
example.txt: ASCII text, with no line terminators
[root workspace]#
[root workspace]#cat example.txt
?????????[root workspace]#
[root workspace]#
[root workspace]#
[root workspace]#file fork.c
fork.c: C source, UTF-8 Unicode text
[root workspace]#
[root workspace]#gcc -g -o fork fork.c
[root workspace]#./fork
[root workspace]#
[root workspace]#file example.txt
example.txt: UTF-8 Unicode text, with no line terminators
[root workspace]#cat example.txt
配置文件中包含中文[root workspace]#
Upvotes: 0
Views: 1015
Reputation: 41774
To get a UTF-8 string reliably and elegantly regardless of the source file encoding just use the u8
prefix
char str[] = u8"\u914D\u7F6E\u6587\u4EF6\u4E2D\u5305\u542B\u4E2D\u6587";
// or
char str[] = u8"配置文件中包含中文"
char str[]
can be changed to char8_t str[]
if you use C23 or C++20. In C23 you can also use
auto str = u8"配置文件中包含中文"
This way you don't need to find the encoded UTF-8 bytes, and when you need another encoding like UTF-16 or UTF-32 just change the type and the prefix (u8
to u
or U
, and char[]
to auto
). The compiler will automatically convert the encoding to guarantee the correct byte sequence in memory
Upvotes: 0
Reputation: 37232
Instead of using wchar_t,I just using char array,Is this correct?
I'd say no. The default character set and encoding for char
is implementation defined (could be EBCDIC or ASCII or UTF-8 or whatever the source file happened to use or anything else) and the default character set and encoding for wchar_t
is also implementation defined (could be UTF-16LE or ...).
If you need the output to be UTF-8; then (especially for portable code) you need to ignore the random default nonsense the C compiler felt like. You should also avoid using char
because whether that's signed or unsigned is implementation defined, avoid using unsigned char
because there's no actual guarantee that it's 8 bits, and avoid using wchar_t
(because its size is implementation defined)
Specifically (for UTF-8), I'd use uint8_t
, like:
uint8_t str[] = 0xE9, 0x85, 0x8D, 0xE7, 0xBD, 0xAE, 0xE6, 0x96, 0x87, 0xE4, 0xBB, 0xB6,
0xE4, 0xB8, 0xAD, 0xE5, 0x8C, 0x85, 0xE5, 0x90, 0xAB, 0xE4, 0xB8, 0xAD,
0xE6, 0x96, 0x87, 0x00;
Of course if you want the file to contain CNS-11643 (or anything else) you could do that too. You just need to find a suitable type, and find the "array of numbers of that type" (e.g. possibly by using a utility like hexdump
on a text file that uses the desired character set and encoding).
Upvotes: -1
Reputation: 385657
Is there an elegant way of representing characters not found in ASCII using just ASCII characters? No.
But it is possible to do so in an inelegant way.
char str[MSG_LEN] = "\xE9\x85\x8D\xE7\xBD\xAE\xE6\x96\x87\xE4\xBB\xB6\xE4\xB8\xAD\xE5\x8C\x85\xE5\x90\xAB\xE4\xB8\xAD\xE6\x96\x87";
Of course, just like your original program, this assumes the person viewing the file names (e.g. using ls
) has a locale based on UTF-8.
Upvotes: 2