sashoalm
sashoalm

Reputation: 79487

Detect text file encoding

In my program I load plain text files supplied by the user:

QFile file(fileName);
file.open(QIODevice::ReadOnly);
QTextStream stream(&file);
const QString &text = stream.readAll();

This works fine when the files are UTF-8 encoded, but some users try to import Windows-1252 encoded files, and if they have words with special characters (for example "è" in "boutonnière"), those will show incorrectly.

Is there a way to detect the encoding, or at least distinguish between UTF-8 (possibly without BOM), and Windows-1252, without asking the user to tell me the encoding?

Upvotes: 5

Views: 9682

Answers (2)

Violet Giraffe
Violet Giraffe

Reputation: 33589

This trick works for me, at least so far. This method does not require BOM to work:

    QTextCodec::ConverterState state;
    QTextCodec *codec = QTextCodec::codecForName("UTF-8");
    const QByteArray data(readSource());
    const QString text = codec->toUnicode(data.constData(), data.size(), &state);
    if (state.invalidChars > 0)
    {
        // Not a UTF-8 text - using system default locale
        QTextCodec * codec = QTextCodec::codecForLocale();
        if (!codec)
           return;

        ui->textBrowser->setPlainText(codec->toUnicode(readSource()));
    }
    else
    {
        ui->textBrowser->setPlainText(text);
    }

Upvotes: 4

sashoalm
sashoalm

Reputation: 79487

Turns out that auto-detecting the encoding is impossible for the general case.

However, there is a workaround to at least fall back to the system locale if the text is not valid UTF-8/UTF-16/UTF-32 text. It uses QTextCodec::codecForUtfText(), which tries to decode a byte array using UTF-8, UTF-16 and UTF-32, and returns the supplied default codec if it fails.

Code to do it:

QTextCodec *codec = QTextCodec::codecForUtfText(byteArray, QTextCodec::codecForName("System"));
const QString &text = codec->toUnicode(byteArray);

Update

The above code will not detect UTF-8 without BOM, however, as codecForUtfText() relies on the BOM markers. To detect UTF-8 without BOM, see https://stackoverflow.com/a/18228382/492336.

Upvotes: 4

Related Questions