Reputation: 9853
I'm using xmlTextWriter
from the libxml2
to write some xml files. And I need to write cyrillic characters into them.
I do it this way:
xmlTextWriterStartDocument(writer, NULL, "utf-8", NULL);
...
snprintf(buf, sizeof(buf), "%s", "тест");
xmlTextWriterWriteAttribute(writer,
(const xmlChar*)"test_attribute",
(const xmlChar*)buf);
But when I open the resulting xml file I see html representation of my text, just like this: test_attribute="тест"
How can I fix this?
Upvotes: 1
Views: 3376
Reputation: 14467
You need to use the separate utf-8 encoder.
In snprintf() your text is in CP-1251 (single-byte ASCII-era encoding), not in UTF-8 (variable-width encoding).
See this link for the reference implementation: http://7maze.ru/node/29
The comments are in russian, but all you need is a conversion table and the
string convertToUtf8(const char* chars, int len)
function at the end.
The "тест" string you used should look like "РўРчС_С'" (absolutely meaningless) while encoded.
An old C code from one old project. It uses the CP-866 encoding (another "popular" encoding from the MS-DOS), but the conversion from CP-1251 is straightforward.
/// CP866 to UTF-8
char *dosstrtou(char *buffer,const char *dosstr)
{
char *buf1=buffer;
while (*dosstr)
{
if ( (*dosstr>127)&&(*dosstr<176) )
{
*buf1=208;
buf1++;
*buf1 = (char)(*dosstr+16);
dosstr++;
buf1++;
continue;
}
if ( (*dosstr>223)&&(*dosstr<240) )
{
*buf1=209;
buf1++;
*buf1 = (char)(*dosstr-96);
dosstr++;
buf1++;
continue;
}
if (*dosstr==240)
{
*buf1=208;
buf1++;
*buf1=129;
dosstr++;
buf1++;
continue;
}
if (*dosstr==241)
{
*buf1=209;
buf1++;
*buf1=145;
dosstr++;
buf1++;
}
*buf1=*dosstr;
buf1++;
dosstr++;
}
*buf1='\0';
return (buffer);
}
/// CP1251 to CP866
char *winstrtodos(char *buffer){
char *ptr=buffer;
while (*ptr!='\0')
{
if ( (*ptr>=0x80+0x40)&&(*ptr<=0xAF+0x40) )
*ptr =(char)(*ptr-0x40);
if ( (*ptr>=0xE0+0x10)&&(*ptr<=0xEF+0x10) )
*ptr = (char)(*ptr-0x10);
if (*ptr==0xA8) *ptr=0xF0;
if (*ptr==0xB8) *ptr=0xF1;
ptr++;
}
return (buffer);
}
Just be careful with the memory.
Upvotes: 2