Lukas Kalbertodt
Lukas Kalbertodt

Reputation: 88946

How to check if a Node.js `Buffer` contains valid UTF-8?

I have a Buffer object and I would like to check if all of it is valid UTF-8. Ideally, I would like to get a string with said decoded UTF-8 text, too.

I tried Buffer.toString which takes an encoding argument which defaults to utf8. Unfortunately the docs say this:

If encoding is 'utf8' and a byte sequence in the input is not valid UTF-8, then each invalid byte is replaced with the replacement character U+FFFD.

That's not what I want: I rather want an exception or a boolean flag. Just checking if the resulting string contains U+FFFD is not the same as the input text could already have contained U+FFFD (just as a valid Unicode codepoint). Of course one could try counting U+FFFD in the buffer and the string and then compare, but that seems uselessly complicated and inefficient.

Is there a better way?

Upvotes: 4

Views: 2603

Answers (2)

import NodeBuffer, {Buffer} from "node:buffer";

NodeBuffer.isUtf8(input)

  • Added in: version 19.4.0, version 18.14.0.
  • input (<Buffer> | <ArrayBuffer> | <TypedArray>)

This function returns true if input contains only valid UTF-8-encoded data, including the case in which input is empty.

Throws if the input is a detached array buffer.

Upvotes: 5

Lukas Kalbertodt
Lukas Kalbertodt

Reputation: 88946

You can use TextDecoder from util. To get an exception, set the fatal flag to true.

new TextDecoder("utf8", { fatal: true }).decode(buffer)

For example:

> new TextDecoder("utf8", { fatal: true }).decode(Buffer.from([72, 195, 182, 240, 159, 146, 154, 215, 169, 214, 184, 215, 129]))
'Hö💚שָׁ'

> new TextDecoder("utf8", { fatal: true }).decode(Buffer.from([1, 2, 255, 3, 5]))
Uncaught:
TypeError [ERR_ENCODING_INVALID_ENCODED_DATA]: The encoded data was not valid for encoding utf-8
    at __node_internal_captureLargerStackTrace (node:internal/errors:478:5)
    at new NodeError (node:internal/errors:387:5)
    at TextDecoder.decode (node:internal/encoding:433:15) {
  errno: 12,
  code: 'ERR_ENCODING_INVALID_ENCODED_DATA'

Upvotes: 3

Related Questions