Reputation: 9136
I'm receiving data as utf8
from a source and this data was originally in binary form (it was a Buffer
). I have to convert back this data to a Buffer
. I'm having a hard time figuring how to do this.
Here's a small sample that shows my problem:
var hexString = 'e61b08020304e61c09020304e61d0a020304e61e65';
var buffer1 = new Buffer(hexString, 'hex');
var str = buffer1.toString('utf8');
var buffer2 = new Buffer(str, 'utf8');
console.log('original content:', hexString);
console.log('buffer1 contains:', buffer1.toString('hex'));
console.log('buffer2 contains:', buffer2.toString('hex'));
prints
original content: e61b08020304e61c09020304e61d0a020304e61e65
buffer1 contains: e61b08020304e61c09020304e61d0a020304e61e65
buffer2 contains: efbfbd1b08020304efbfbd1c09020304efbfbd1d0a020304efbfbd1e65
Here, I would like buffer2
to be the exact same thing as buffer1
.
How can I convert an utf8
string to its original binary Buffer
?
Upvotes: 7
Views: 8608
Reputation: 3067
The accepted answer is misleading. Your main problem is that you're dealing with invalid UTF-8. If the data were valid, the conversion would not cause issues.
Specifically, take the first two bytes: e61b
.
In binary, that's: 11100110
, 00011011
. This is invalid. Take a look at this diagram from the utf-8 wikipedia page.
This says that if a byte starts with 1110
, the next byte must start with two bytes starting with 10
after it. This is not the case here.
Whenever js hits an invalid character, it replaces it with �, the unicode replacement character. The codepoint for that is U+FFFD, and the utf-8 encoding of that code point is efbfbd
. Notice that this shows up in your output a few times.
Upvotes: 0
Reputation: 106698
You cannot expect binary data converted to utf8 and back again to be the same as the original binary data because of the way utf8 works (especially when invalid utf8 characters are replaced with \ufffd
).
You have to use another format that correctly preserves the data. This could be 'hex', 'base64', 'binary', or some other binary-safe format provided by a third-party module. Obviously you should probably keep it as a Buffer if you can.
Upvotes: 11