Node Buffers, from utf8 to binary

I'm receiving data as utf8 from a source and this data was originally in binary form (it was a Buffer). I have to convert back this data to a Buffer. I'm having a hard time figuring how to do this.

Here's a small sample that shows my problem:

var hexString = 'e61b08020304e61c09020304e61d0a020304e61e65';
var buffer1 = new Buffer(hexString, 'hex');

var str = buffer1.toString('utf8');
var buffer2 = new Buffer(str, 'utf8');

console.log('original content:', hexString);
console.log('buffer1 contains:', buffer1.toString('hex'));
console.log('buffer2 contains:', buffer2.toString('hex'));

prints

original content: e61b08020304e61c09020304e61d0a020304e61e65
buffer1 contains: e61b08020304e61c09020304e61d0a020304e61e65
buffer2 contains: efbfbd1b08020304efbfbd1c09020304efbfbd1d0a020304efbfbd1e65

Here, I would like buffer2 to be the exact same thing as buffer1.

How can I convert an utf8 string to its original binary Buffer?

Upvotes: 7

Views: 8608

Answers (2)

theicfire
theicfire

Reputation: 3067

The accepted answer is misleading. Your main problem is that you're dealing with invalid UTF-8. If the data were valid, the conversion would not cause issues.

Specifically, take the first two bytes: e61b.

In binary, that's: 11100110, 00011011. This is invalid. Take a look at this diagram from the utf-8 wikipedia page.

enter image description here

This says that if a byte starts with 1110, the next byte must start with two bytes starting with 10 after it. This is not the case here.

Whenever js hits an invalid character, it replaces it with �, the unicode replacement character. The codepoint for that is U+FFFD, and the utf-8 encoding of that code point is efbfbd. Notice that this shows up in your output a few times.

Upvotes: 0

mscdex
mscdex

Reputation: 106698

You cannot expect binary data converted to utf8 and back again to be the same as the original binary data because of the way utf8 works (especially when invalid utf8 characters are replaced with \ufffd).

You have to use another format that correctly preserves the data. This could be 'hex', 'base64', 'binary', or some other binary-safe format provided by a third-party module. Obviously you should probably keep it as a Buffer if you can.

Upvotes: 11

Related Questions