Reputation: 10103
I have a UTF-8 encoded text file, which has characters such as ²,³,Ç and ó. When I read the file using the below, the file appears to be read appropriately (at least according to what I can see in Visual Studio's editor when viewing the contents of the contents
variable)
QFile file( filePath );
if ( !file.open( QFile::ReadOnly | QFile::Text ) ) {
return;
}
QString contents;
QTextStream stream( &file );
contents.append( stream.readAll() );
file.close();
However, as soon as the contents get converted to an std::string
the additional characters are added. For example, the ²
gets converted to ²
, when it should just be ². This appears to happen for every non-ANSI character, the extra Â
is added, which, of course, means that when the a new file is saved, the characters are not correct in the output file.
I have, of course, tried simply doing toStdString()
, I've also tried toUtf8
and have even tried using the QTextCodec
but each fails to give the proper values.
I do not understand why going from UTF-8 file, to QString, then to std::string loses the UTF-8 characters. It should be able to reproduce the exact file that was originally read, or am I completely missing something?
Upvotes: 0
Views: 684
Reputation: 10103
As Daniel Kamil Kozar mentioned in his answer, the QTextStream
does not read in the encoding, and, therefore, does not actually read the file correctly. The QTextStream
must set its Codec prior to reading the file in order to properly parse the characters. Added a comment to the code below to show the extra file needed.
QFile file( filePath );
if ( !file.open( QFile::ReadOnly | QFile::Text ) ) {
return;
}
QString contents;
QTextStream stream( &file );
stream.setCodec( QTextCodec::codecForName( "UTF-8" ) ); // This is required.
contents.append( stream.readAll() );
file.close();
Upvotes: 3
Reputation: 19336
What you're seeing is actually the expected behaviour.
The string ²
consists of the bytes C3 82 C2 B2
when encoded as UTF-8. Assuming that QTextStream
actually recognises UTF-8 correctly (which isn't all that obvious, judging from the documentation, which only mentions character encoding detection when there's a BOM present, and you haven't said anything about the input file having a BOM), we can assume that the QString
which is returned by QTextStream::readAll
actually contains the string ²
.
QString::toStdString()
returns a UTF-8 encoded variant of the string that the given QString
represents, so the return value should contain the same bytes as the input file - namely C3 82 C2 B2
.
Now, about what you're seeing in the debugger :
0xC2 0xB2
in the string (which is correct).". This is only partially true : QString uses UTF-16LE internally, which means that its internal character array contains two 16-bit values : 0x00C2 0x00B2
. These, in fact, map to the characters Â
and ²
when each is encoded as UTF-16, which proves that the QString
is constructed correctly based on the input from the file. However, your debugger seems to be smart enough to know that the bytes which make up a QString
are encoded in UTF-16 and thus renders the characters correctly.std::string
returned from QString::toStdString
as ²
. Assuming that your debugger uses the dreaded "ANSI code page" for resolving bytes to characters when no encoding is stated explicitly, and you're using a English-language Windows which uses Windows-1252 as its default legacy code page, everything fits into place : the std::string
actually contains the bytes C3 82 C2 B2
, which map to the characters ²
in Windows-1252.Shameless self plug : I delivered a talk about character encodings at a conference last year. Perhaps watching it will help you understand some of these problems better.
One last thing : ANSI is not an encoding. It can mean a number of different encodings based on Windows' regional settings.
Upvotes: 0