Reputation: 45
I want to verify BOM in UTF-8, and wrote the c++ codes below.
However, the result was 0XFFFFFFEF, 0XFFFFFFBB, 0XFFFFFFBF.
This is different from what I expected 0XEF, 0XBB, 0XBF.
Why did the result became above?
By the way, the UTF-8 file used was made by Notepad++.
#include <iostream>
#include <fstream>
using namespace std;
int main()
{
char file[]="/*UTF-8 file*/";
char a[3]{};
ifstream ifs(file, ios_base::binary);
ifs.read(a, static_cast<streamsize>(sizeof(a)));
cout << showbase << uppercase;
for(int i:a){
cout << hex << i << endl;
}
}
GCC 9.2.0
compile option:-std=c++2a
Upvotes: 0
Views: 144
Reputation: 598279
The BOM itself is fine. You are simply printing out the bytes incorrectly.
The result you are seeing is due to sign extending signed 8bit char
values to signed 32bit integers. Whether char
is signed or unsigned is compiler-defined, unless you state it explicitly in code. In your case, you are using (implicitly) signed char
. Signed char
values > 127 will have their high bit set to 1, which will fill in the new bits with 1s when extending a signed 8bit value to a signed 32bit value.
To output the bytes correctly, you need the values to be zero-extended, not sign-extended. Use unsigned
types for that, eg:
#include <iostream>
#include <fstream>
using namespace std;
int main()
{
char file[] = "/*UTF-8 file*/";
unsigned char a[3];
ifstream ifs(file, ios_base::binary);
ifs.read(reinterpret_cast<char*>(a), sizeof(a));
cout << showbase << uppercase;
for(unsigned int i : a){
cout << hex << setw(2) << setfill(‘0’) << i << endl;
}
}
Upvotes: 3