zavg
zavg

Reputation: 11081

Check if the bytes sequence is valid UTF-8 sequence in Javascript

Is there a simple way to check if string is valid UTF-8 sequence in JavaScript?

I really do not want to end with a regular expression like this:

Regex to detect invalid UTF-8 string

P.S.: I am receiving data from external API and sometimes (very rarely but it happens) it returns data with invalid UTF-8 sequences. Trying to put them into PostgreSQL results in an appropriate error.

Upvotes: 5

Views: 5885

Answers (1)

Raffaele
Raffaele

Reputation: 20885

UTF-8 is in fact a simple encoding, but still what you are asking can't be done with a one-liner. You have to:

  1. Override the Content-Type of the response to have a byte array in your script and prevent the browser/library to interpret the response itself
  2. Looping over the bytes to make characters. Note that UTF-8 is a variable-length encoding, and that's why some sequences are invalid.
  3. If an invalid octet is found, skip it
  4. If needed, deserialize the JSON/XML/whatever string to a JavaScript object, possibly by handing failures

Deciding if a certain array is a valid UTF-8 sequence is quite a straightforward task (just a bunch of if statements and bit shiftings), but again it's not a one line thing.

Upvotes: 5

Related Questions