ChrisMM
ChrisMM

Reputation: 10103

Qt UTF-8 File to std::string Adds extra characters

I have a UTF-8 encoded text file, which has characters such as ²,³,Ç and ó. When I read the file using the below, the file appears to be read appropriately (at least according to what I can see in Visual Studio's editor when viewing the contents of the contents variable)

QFile file( filePath );
if ( !file.open( QFile::ReadOnly | QFile::Text ) ) {
    return;
}
QString contents;
QTextStream stream( &file );
contents.append( stream.readAll() );
file.close();

However, as soon as the contents get converted to an std::string the additional characters are added. For example, the ² gets converted to ², when it should just be ². This appears to happen for every non-ANSI character, the extra  is added, which, of course, means that when the a new file is saved, the characters are not correct in the output file.

I have, of course, tried simply doing toStdString(), I've also tried toUtf8 and have even tried using the QTextCodec but each fails to give the proper values.

I do not understand why going from UTF-8 file, to QString, then to std::string loses the UTF-8 characters. It should be able to reproduce the exact file that was originally read, or am I completely missing something?

Upvotes: 0

Views: 684

Answers (2)

ChrisMM
ChrisMM

Reputation: 10103

As Daniel Kamil Kozar mentioned in his answer, the QTextStream does not read in the encoding, and, therefore, does not actually read the file correctly. The QTextStream must set its Codec prior to reading the file in order to properly parse the characters. Added a comment to the code below to show the extra file needed.

QFile file( filePath );
if ( !file.open( QFile::ReadOnly | QFile::Text ) ) {
    return;
}
QString contents;
QTextStream stream( &file );
stream.setCodec( QTextCodec::codecForName( "UTF-8" ) ); // This is required.
contents.append( stream.readAll() );
file.close();

Upvotes: 3

Daniel Kamil Kozar
Daniel Kamil Kozar

Reputation: 19336

What you're seeing is actually the expected behaviour.

The string ² consists of the bytes C3 82 C2 B2 when encoded as UTF-8. Assuming that QTextStream actually recognises UTF-8 correctly (which isn't all that obvious, judging from the documentation, which only mentions character encoding detection when there's a BOM present, and you haven't said anything about the input file having a BOM), we can assume that the QString which is returned by QTextStream::readAll actually contains the string ².

QString::toStdString() returns a UTF-8 encoded variant of the string that the given QString represents, so the return value should contain the same bytes as the input file - namely C3 82 C2 B2.

Now, about what you're seeing in the debugger :

  1. You've stated in one of the comments that "QString only has 0xC2 0xB2 in the string (which is correct).". This is only partially true : QString uses UTF-16LE internally, which means that its internal character array contains two 16-bit values : 0x00C2 0x00B2. These, in fact, map to the characters  and ² when each is encoded as UTF-16, which proves that the QString is constructed correctly based on the input from the file. However, your debugger seems to be smart enough to know that the bytes which make up a QString are encoded in UTF-16 and thus renders the characters correctly.
  2. You've also stated that the debugger shows the content of the std::string returned from QString::toStdString as ². Assuming that your debugger uses the dreaded "ANSI code page" for resolving bytes to characters when no encoding is stated explicitly, and you're using a English-language Windows which uses Windows-1252 as its default legacy code page, everything fits into place : the std::string actually contains the bytes C3 82 C2 B2, which map to the characters ² in Windows-1252.

Shameless self plug : I delivered a talk about character encodings at a conference last year. Perhaps watching it will help you understand some of these problems better.

One last thing : ANSI is not an encoding. It can mean a number of different encodings based on Windows' regional settings.

Upvotes: 0

Related Questions