Alosyius
Alosyius

Reputation: 9111

Nodejs convert string into UTF-8

From my DB im getting the following string:

Johan Öbert

What it should say is:

Johan Öbert

I've tried to convert it into utf-8 like so:

nameString.toString("utf8");

But still same problem.

Any ideas?

Upvotes: 78

Views: 317004

Answers (8)

Lord Elrond
Lord Elrond

Reputation: 16002

I'd recommend using Buffer:

Buffer.from('someString', '<input-encoding>').toString('utf-8')

This avoids any unnecessary dependencies that other answers require, since Buffer is included with node.js, and is already defined in the global scope.

Upvotes: 93

Rich Remer
Rich Remer

Reputation: 2603

The other answers here are either somewhat incorrect or misleading. This might lead to confusing people who don't understand the details of what's actually going on.

If you have a string in Node.js, it is already UTF-8 internally. Logically, Node.js handles strings as Unicode strings. The details of whether it is UTF-8, UTF-16, UTF-32, or whatever just don't matter. It does not make sense to "convert" a string to UTF-8.

If you have a Unicode string in Node.js, and you want the bytes that make up that string in a particular encoding, you use:

const string = "Johan Öbert";
const utf8_buffer = Buffer.from(string, "utf8");
const utf16_buffer = Buffer.from(string, "utf16le");

As you can see from that example, the string has no encoding of any kind. But the bytes you would use to make up that string in a particular encoding can be easily calculated using Buffer.from.

As you can see here, the following encodings are supported: utf8/utf-8, utf16le/utf-16le, latin1, base64, base64url, hex, and the legacy encodings of ascii, binary (alias for latin1), and ucs2/ucs-2 (alias of utf16le).

Regarding the legacy encodings: ascii is misleading as it acts like latin1 for writing and when reading it serves to sanitize junk data; binary has no more relation to binary data than any other encoding and acts like latin1; and ucs2/ucs-2 are imprecise and act like utf16le, rather than actual UCS-2.

There are some edge cases for when you might want ascii, but none of the other legacy encodings have any value, as they are just aliases of non-legacy encodings. Using the non-legacy encoding makes your code clearer, and should be preferred.

If you have a string that appears to be encoded wrong, there are a few things to keep in mind.

  1. you must know the original encoding of the data that was improperly decoded
  2. not all bad encodings are recoverable
  3. many default legacy encodings (like windows-1252) are irrecoverable
  4. you almost certainly have an upstream data problem; don't begin by trying to use Node.js to juggle buffers in different encodings.

So, first, if you have junk decodings coming from your DB, you need to fix it in your DB. This might be a flag when creating the table schema to set the encoding of a table or column. This might be a connection setting from the app that writes to the DB. This might be a connection setting from you Node.js app that reads from the DB. Start here and identify where the breakdown is. Make sure the table is storing your data in something useful. Make sure all connections are using/expecting the same encoding.

Second, if you have a mismatch in your DB connections/table, you might have data that is junk. The general rule is always (with NO exceptions) use UTF-8 everywhere. No other strategy makes any kind of sense for any scenario where UTF-8 is available. When you mis-use a connection encoding or table/column encoding, you often have data loss. If you don't have data loss, you still might have to re-encode all the data in your DB before you can use it effectively.

The only common scenarios I know of that don't support UTF-8 are TEXT columns in MS SQL Server and CHAR/VARCHAR columns in MS SQL Server pre-2019. I think Oracle also has some limitations, only supporting a single encoding for the entire DB which applies to all connections.

Third, if you need to fix junk data (i.e., the problem is something other than the connection encoding used when reading from the DB), you can probably use one of two strategies to fix the incorrect data.

If you have something supported by Buffer, like base64, you are in luck. This is pretty simple:

// actual question is not supported by Buffer
const corrected = Buffer.from("Sm9oYW4gw5ZiZXJ0", "base64").toString();

If your bad data is in an encoding not supported by Buffer, but is one of the encodings supported by TextDecoder, you can do something like the following:

// start with badly encoded string
const string = "Johan Öbert";
// get UTF-8 bytes that make up this string
const bytes = Buffer.from(string, "utf8");
// re-decode the bytes using the correct decoder
// NOTE: actual bad data is NOT windows-1252 (q.v. #1 above)
const corrected = new TextDecoder("windows-1252").decode(bytes);

Upvotes: 4

Ronnie Smith
Ronnie Smith

Reputation: 18545

TextEncoder (available since Node.js v11) and Node's buffer module both do this.

TextEncoder

const encoder = new TextEncoder();
const bytes = encoder.encode('Johan Öbert');
const decoder = new TextDecoder('utf-8');
console.log(decoder.decode(bytes));

Node.js Buffer

In terms of Node's buffer module, UTF-8 is the default .toString encoding.

When converting between Buffers and strings, a character encoding may be specified. If no character encoding is specified, UTF-8 will be used as the default. source

Buffer.from('Johan Öbert').toString();

Note: neither of these actually change the string "Johan Öbert" to "Johan Öbert".

Upvotes: 4

Bitdom8
Bitdom8

Reputation: 1452

Just add this <?xml version="1.0" encoding="UTF-8"?>, will encode. For instance, an RSS would be made with any char after adding this

<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
    >....

Also add to your parent layout or main app.html <meta charset="utf-8" />

<!DOCTYPE html>
<html lang="en" class="overflowhere">
    <head>
        <meta charset="utf-8" />

    </head>
</html>

Upvotes: -2

Mat&#237;as Moreno
Mat&#237;as Moreno

Reputation: 59

You should be setting the database connection's charset, instead of fighting it inside nodejs:

SET NAMES 'utf8';

(works at least in MySQL and PostgreSQL)

Keep in mind you need to run that for every connection. If you're using a connection pool, do it with an event handler, eg.:

mysqlPool.on('connection', function (connection) {
  connection.query("SET NAMES 'utf8'")
});

https://dev.mysql.com/doc/refman/8.0/en/charset-connection.html#charset-connection-client-configuration https://www.postgresql.org/docs/current/multibyte.html#id-1.6.10.5.7 https://www.npmjs.com/package/mysql#connection

Upvotes: 2

Jayram
Jayram

Reputation: 19578

Use the utf8 module from npm to encode/decode the string.

Installation:

npm install utf8

In a browser:

<script src="utf8.js"></script>

In Node.js:

const utf8 = require('utf8');

API:

Encode:

utf8.encode(string)

Encodes any given JavaScript string (string) as UTF-8, and returns the UTF-8-encoded version of the string. It throws an error if the input string contains a non-scalar value, i.e. a lone surrogate. (If you need to be able to encode non-scalar values as well, use WTF-8 instead.)

// U+00A9 COPYRIGHT SIGN; see http://codepoints.net/U+00A9
utf8.encode('\xA9');
// → '\xC2\xA9'
// U+10001 LINEAR B SYLLABLE B038 E; see http://codepoints.net/U+10001
utf8.encode('\uD800\uDC01');
// → '\xF0\x90\x80\x81'

Decode:

utf8.decode(byteString)

Decodes any given UTF-8-encoded string (byteString) as UTF-8, and returns the UTF-8-decoded version of the string. It throws an error when malformed UTF-8 is detected. (If you need to be able to decode encoded non-scalar values as well, use WTF-8 instead.)

utf8.decode('\xC2\xA9');
// → '\xA9'

utf8.decode('\xF0\x90\x80\x81');
// → '\uD800\uDC01'
// → U+10001 LINEAR B SYLLABLE B038 E

Resources

Upvotes: 59

Tobias Nickel
Tobias Nickel

Reputation: 492

I had the same problem, when i loaded a text file via fs.readFile(), I tried to set the encodeing to UTF8, it keeped the same. my solution now is this:

myString = JSON.parse( JSON.stringify( myString ) )

after this an Ö is realy interpreted as an Ö.

Upvotes: 20

paaat
paaat

Reputation: 545

When you want to change the encoding you always go from one into another. So you might go from Mac Roman to UTF-8 or from ASCII to UTF-8.

It's as important to know the desired output encoding as the current source encoding. For example if you have Mac Roman and you decode it from UTF-16 to UTF-8 you'll just make it garbled.

If you want to know more about encoding this article goes into a lot of details:

What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text

The npm pacakge encoding which uses node-iconv or iconv-lite should allow you to easily specify which source and output encoding you want:

var resultBuffer = encoding.convert(nameString, 'ASCII', 'UTF-8');

Upvotes: 7

Related Questions