jtrfe
jtrfe

Reputation: 21

NodeJS UTF8 Encoding A Buffer Then Decoding That UTF8 String Produces A Buffer With Different Content

I typed this into the nodejs console

new Buffer(new Buffer([0xde]).toString('utf8'), 'utf8')

and it prints out

<Buffer ef bf bd>

After looking at the docs it seems that this would produce an identical buffer. I'm creating a utf8 encoded string from a buffer whose contents consist of one byte (0xde) then using that utf8 encoded string to create a buffer. Am I missing something here?

Upvotes: 2

Views: 2750

Answers (1)

mscdex
mscdex

Reputation: 106746

For encodings that can be multi-byte, you cannot expect to get the same bytes back that you started with in all cases. In the case of UTF-8, some characters require more than one byte to be represented properly.

In your example, 0xde exceeds 0x7f which is the largest value for a character that can be represented by a single byte. So when you then call .toString('utf8'), node sees that it only has one byte and instead returns the UTF-8 character \uFFFD (0xef, 0xbf, 0xbd in hex) which is used to denote an unknown/unrepresentable character. Then reading back in this "replacement character" value back into a new Buffer is no problem, as it is a valid UTF-8 character.

Upvotes: 4

Related Questions