THE 1987
THE 1987

Reputation: 45

Is this BOM in UTF-8 incorrect?

I want to verify BOM in UTF-8, and wrote the c++ codes below.

However, the result was 0XFFFFFFEF, 0XFFFFFFBB, 0XFFFFFFBF.

This is different from what I expected 0XEF, 0XBB, 0XBF.

Why did the result became above?

By the way, the UTF-8 file used was made by Notepad++.

#include <iostream>
#include <fstream>

using namespace std;

int main()
{
        char file[]="/*UTF-8 file*/"; 
        
        char a[3]{};

        ifstream ifs(file, ios_base::binary);
        
        ifs.read(a, static_cast<streamsize>(sizeof(a)));
        
        cout << showbase << uppercase;
        
        for(int i:a){
                cout << hex << i << endl;
        }
}

Environment

GCC 9.2.0

compile option:-std=c++2a

Upvotes: 0

Views: 144

Answers (1)

Remy Lebeau
Remy Lebeau

Reputation: 598279

The BOM itself is fine. You are simply printing out the bytes incorrectly.

The result you are seeing is due to sign extending signed 8bit char values to signed 32bit integers. Whether char is signed or unsigned is compiler-defined, unless you state it explicitly in code. In your case, you are using (implicitly) signed char. Signed char values > 127 will have their high bit set to 1, which will fill in the new bits with 1s when extending a signed 8bit value to a signed 32bit value.

To output the bytes correctly, you need the values to be zero-extended, not sign-extended. Use unsigned types for that, eg:

#include <iostream>
#include <fstream>

using namespace std;

int main()
{
    char file[] = "/*UTF-8 file*/";
    unsigned char a[3];

    ifstream ifs(file, ios_base::binary);
    ifs.read(reinterpret_cast<char*>(a), sizeof(a));

    cout << showbase << uppercase;

    for(unsigned int i : a){
        cout << hex << setw(2) << setfill(‘0’) << i << endl;
    }
}

Upvotes: 3

Related Questions