Reputation: 9111
From my DB im getting the following string:
Johan Öbert
What it should say is:
Johan Öbert
I've tried to convert it into utf-8 like so:
nameString.toString("utf8");
But still same problem.
Any ideas?
Upvotes: 78
Views: 317004
Reputation: 16002
I'd recommend using Buffer
:
Buffer.from('someString', '<input-encoding>').toString('utf-8')
This avoids any unnecessary dependencies that other answers require, since Buffer
is included with node.js
, and is already defined in the global scope.
Upvotes: 93
Reputation: 2603
The other answers here are either somewhat incorrect or misleading. This might lead to confusing people who don't understand the details of what's actually going on.
If you have a string in Node.js, it is already UTF-8 internally. Logically, Node.js handles strings as Unicode strings. The details of whether it is UTF-8, UTF-16, UTF-32, or whatever just don't matter. It does not make sense to "convert" a string to UTF-8.
If you have a Unicode string in Node.js, and you want the bytes that make up that string in a particular encoding, you use:
const string = "Johan Öbert";
const utf8_buffer = Buffer.from(string, "utf8");
const utf16_buffer = Buffer.from(string, "utf16le");
As you can see from that example, the string
has no encoding of any kind. But the bytes you would use to make up that string in a particular encoding can be easily calculated using Buffer.from
.
As you can see here, the following encodings are supported: utf8
/utf-8
, utf16le
/utf-16le
, latin1
, base64
, base64url
, hex
, and the legacy encodings of ascii
, binary
(alias for latin1
), and ucs2
/ucs-2
(alias of utf16le
).
Regarding the legacy encodings: ascii
is misleading as it acts like latin1
for writing and when reading it serves to sanitize junk data; binary
has no more relation to binary data than any other encoding and acts like latin1
; and ucs2
/ucs-2
are imprecise and act like utf16le
, rather than actual UCS-2.
There are some edge cases for when you might want ascii
, but none of the other legacy encodings have any value, as they are just aliases of non-legacy encodings. Using the non-legacy encoding makes your code clearer, and should be preferred.
If you have a string that appears to be encoded wrong, there are a few things to keep in mind.
So, first, if you have junk decodings coming from your DB, you need to fix it in your DB. This might be a flag when creating the table schema to set the encoding of a table or column. This might be a connection setting from the app that writes to the DB. This might be a connection setting from you Node.js app that reads from the DB. Start here and identify where the breakdown is. Make sure the table is storing your data in something useful. Make sure all connections are using/expecting the same encoding.
Second, if you have a mismatch in your DB connections/table, you might have data that is junk. The general rule is always (with NO exceptions) use UTF-8 everywhere. No other strategy makes any kind of sense for any scenario where UTF-8 is available. When you mis-use a connection encoding or table/column encoding, you often have data loss. If you don't have data loss, you still might have to re-encode all the data in your DB before you can use it effectively.
The only common scenarios I know of that don't support UTF-8 are TEXT columns in MS SQL Server and CHAR/VARCHAR columns in MS SQL Server pre-2019. I think Oracle also has some limitations, only supporting a single encoding for the entire DB which applies to all connections.
Third, if you need to fix junk data (i.e., the problem is something other than the connection encoding used when reading from the DB), you can probably use one of two strategies to fix the incorrect data.
If you have something supported by Buffer
, like base64
, you are in luck. This is pretty simple:
// actual question is not supported by Buffer
const corrected = Buffer.from("Sm9oYW4gw5ZiZXJ0", "base64").toString();
If your bad data is in an encoding not supported by Buffer
, but is one of the encodings supported by TextDecoder
, you can do something like the following:
// start with badly encoded string
const string = "Johan Öbert";
// get UTF-8 bytes that make up this string
const bytes = Buffer.from(string, "utf8");
// re-decode the bytes using the correct decoder
// NOTE: actual bad data is NOT windows-1252 (q.v. #1 above)
const corrected = new TextDecoder("windows-1252").decode(bytes);
Upvotes: 4
Reputation: 18545
TextEncoder (available since Node.js v11) and Node's buffer
module both do this.
const encoder = new TextEncoder();
const bytes = encoder.encode('Johan Öbert');
const decoder = new TextDecoder('utf-8');
console.log(decoder.decode(bytes));
Buffer
In terms of Node's buffer
module, UTF-8 is the default .toString
encoding.
When converting between Buffers and strings, a character encoding may be specified. If no character encoding is specified, UTF-8 will be used as the default. source
Buffer.from('Johan Öbert').toString();
Note: neither of these actually change the string "Johan Öbert" to "Johan Öbert".
Upvotes: 4
Reputation: 1452
Just add this <?xml version="1.0" encoding="UTF-8"?>
, will encode. For instance, an RSS would be made with any char after adding this
<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
>....
Also add to your parent layout or main app.html <meta charset="utf-8" />
<!DOCTYPE html>
<html lang="en" class="overflowhere">
<head>
<meta charset="utf-8" />
</head>
</html>
Upvotes: -2
Reputation: 59
You should be setting the database connection's charset, instead of fighting it inside nodejs:
SET NAMES 'utf8';
(works at least in MySQL and PostgreSQL)
Keep in mind you need to run that for every connection. If you're using a connection pool, do it with an event handler, eg.:
mysqlPool.on('connection', function (connection) {
connection.query("SET NAMES 'utf8'")
});
https://dev.mysql.com/doc/refman/8.0/en/charset-connection.html#charset-connection-client-configuration https://www.postgresql.org/docs/current/multibyte.html#id-1.6.10.5.7 https://www.npmjs.com/package/mysql#connection
Upvotes: 2
Reputation: 19578
Use the utf8 module from npm to encode/decode the string.
Installation:
npm install utf8
In a browser:
<script src="utf8.js"></script>
In Node.js:
const utf8 = require('utf8');
API:
Encode:
utf8.encode(string)
Encodes any given JavaScript string (string) as UTF-8, and returns the UTF-8-encoded version of the string. It throws an error if the input string contains a non-scalar value, i.e. a lone surrogate. (If you need to be able to encode non-scalar values as well, use WTF-8 instead.)
// U+00A9 COPYRIGHT SIGN; see http://codepoints.net/U+00A9
utf8.encode('\xA9');
// → '\xC2\xA9'
// U+10001 LINEAR B SYLLABLE B038 E; see http://codepoints.net/U+10001
utf8.encode('\uD800\uDC01');
// → '\xF0\x90\x80\x81'
Decode:
utf8.decode(byteString)
Decodes any given UTF-8-encoded string (byteString) as UTF-8, and returns the UTF-8-decoded version of the string. It throws an error when malformed UTF-8 is detected. (If you need to be able to decode encoded non-scalar values as well, use WTF-8 instead.)
utf8.decode('\xC2\xA9');
// → '\xA9'
utf8.decode('\xF0\x90\x80\x81');
// → '\uD800\uDC01'
// → U+10001 LINEAR B SYLLABLE B038 E
Upvotes: 59
Reputation: 492
I had the same problem, when i loaded a text file via fs.readFile()
, I tried to set the encodeing to UTF8, it keeped the same. my solution now is this:
myString = JSON.parse( JSON.stringify( myString ) )
after this an Ö is realy interpreted as an Ö.
Upvotes: 20
Reputation: 545
When you want to change the encoding you always go from one into another. So you might go from Mac Roman
to UTF-8
or from ASCII
to UTF-8
.
It's as important to know the desired output encoding as the current source encoding. For example if you have Mac Roman
and you decode it from UTF-16
to UTF-8
you'll just make it garbled.
If you want to know more about encoding this article goes into a lot of details:
The npm pacakge encoding which uses node-iconv or iconv-lite should allow you to easily specify which source and output encoding you want:
var resultBuffer = encoding.convert(nameString, 'ASCII', 'UTF-8');
Upvotes: 7