Christopher
Christopher

Reputation: 534

Loading EUC-JP and other Japanese text encodings in Node.JS

I'm trying to scrape some Japanese websites for a personal project. Sites with text in UTF-8 work perfectly fine, as you'd expect, but I can't get any text out of sites specifying other international encodings, specifically EUC-JP. Node also seems to be interpreting the text and performing modifications rather than passing it on raw - I've tried setting the response to be interpreted as both ascii and binary, and then set my terminal application to EUC-JP, but after doing a console.log(), neither result in the actual text.

I've had a scan through the Node documentation, and it seems to only support two main text encodings (apart from binary and base64.)

I'm using the inbuilt http client, and specifying the encoding through the response.setEncoding method, e.g. response.setEncoding('utf8');

How are other people working with international text in Node (especially with regard to situations where the original data is not in UTF-8?) Are binary buffers the only way?

While I've done a bit of research, I'm not hugely knowledgeable when it comes to character encoding, so simple answers would be appreciated. Thanks!

Upvotes: 2

Views: 3038

Answers (1)

一二三
一二三

Reputation: 21249

There is a module that adds iconv bindings to node.js. If you grab the response as a binary Buffer, you can use Iconv.convert to convert it from EUC-JP to UTF-8 (take a look at the README for an example).

Upvotes: 2

Related Questions