Is this BOM in UTF-8 incorrect?

Question

I want to verify BOM in UTF-8, and wrote the c++ codes below.

However, the result was 0XFFFFFFEF, 0XFFFFFFBB, 0XFFFFFFBF.

This is different from what I expected 0XEF, 0XBB, 0XBF.

Why did the result became above?

By the way, the UTF-8 file used was made by Notepad++.

#include 
#include 

using namespace std;

int main()
{
        char file[]="/*UTF-8 file*/"; 
        
        char a[3]{};

        ifstream ifs(file, ios_base::binary);
        
        ifs.read(a, static_cast(sizeof(a)));
        
        cout << showbase << uppercase;
        
        for(int i:a){
                cout << hex << i << endl;
        }
}

Environment

GCC 9.2.0

compile option：-std=c++2a

Remy Lebeau · Accepted Answer

The BOM itself is fine. You are simply printing out the bytes incorrectly.

The result you are seeing is due to sign extending signed 8bit char values to signed 32bit integers. Whether char is signed or unsigned is compiler-defined, unless you state it explicitly in code. In your case, you are using (implicitly) signed char. Signed char values > 127 will have their high bit set to 1, which will fill in the new bits with 1s when extending a signed 8bit value to a signed 32bit value.

To output the bytes correctly, you need the values to be zero-extended, not sign-extended. Use unsigned types for that, eg:

#include 
#include 

using namespace std;

int main()
{
    char file[] = "/*UTF-8 file*/";
    unsigned char a[3];

    ifstream ifs(file, ios_base::binary);
    ifs.read(reinterpret_cast(a), sizeof(a));

    cout << showbase << uppercase;

    for(unsigned int i : a){
        cout << hex << setw(2) << setfill(‘0’) << i << endl;
    }
}

Is this BOM in UTF-8 incorrect?

Environment

Answers (1)

Related Questions