Julien Fouilhé
Julien Fouilhé

Reputation: 2658

How to convert a UTF16 file into a UTF8 file in nodejs

I have an xml file encoded in UTF16, and I would like to convert it to UTF8 in order to process it. If I use this command:

iconv -f UTF-16 -t UTF-8 file.xml > converted_file.xml

The file is converted correctly and I'm able to process it. I want to do the same in nodejs.

Currently I have a buffer of my file and I've tried everything I could think of and what I could find on the internet but unsuccessfully.

Here is some examples of what I've tried so far:

content = new Buffer((new Buffer(content, 'ucs2')).toString('utf8'));

I've also tried using those functions:

http://jonisalonen.com/2012/from-utf-16-to-utf-8-in-javascript/ https://stackoverflow.com/a/14601808/1405208

The first one doen't change anything and the links only give me chinese characters.

Upvotes: 3

Views: 6856

Answers (2)

NatanS
NatanS

Reputation: 86

While the answer above me is the best answer for the question asked. I'm hoping that this answer will help some folks that need to read a file as a binary string:

const reader = new FileReader();
reader.readAsBinaryString(this.fileToImport);

In my case the file was in utf-16 and I tried to read it into XLSX:

const wb = XLSX.read(bstr, { type: "binary" });

Combining both links from above, I first removed the first two chars that signaled it was UTF-16 (0xFFFE) then used this link to create the right number (but I think that it actually provides UTF-7 encoding) https://stackoverflow.com/a/14601808/1405208

Lastly, I applied the second link to get the right set of UTF-8 number: https://stackoverflow.com/a/14601808/1405208

The Code that I ended up with:

decodeUTF16LE(binaryStr) {
      if (binaryStr.charCodeAt(0) != 255 || binaryStr.charCodeAt(1) != 254) {
        return binaryStr;
      }
      const utf8 = [];
      for (var i = 2; i < binaryStr.length; i += 2) {
        let charcode = binaryStr.charCodeAt(i) | (binaryStr.charCodeAt(i + 1) << 8);
        if (charcode < 0x80) utf8.push(charcode);
        else if (charcode < 0x800) {
          utf8.push(0xc0 | (charcode >> 6), 0x80 | (charcode & 0x3f));
        } else if (charcode < 0xd800 || charcode >= 0xe000) {
          utf8.push(0xe0 | (charcode >> 12), 0x80 | ((charcode >> 6) & 0x3f), 0x80 | (charcode & 0x3f));
        }
        // surrogate pair
        else {
          i++;
          // UTF-16 encodes 0x10000-0x10FFFF by
          // subtracting 0x10000 and splitting the
          // 20 bits of 0x0-0xFFFFF into two halves
          charcode = 0x10000 + (((charcode & 0x3ff) << 10) | (charcode & 0x3ff));
          utf8.push(
            0xf0 | (charcode >> 18),
            0x80 | ((charcode >> 12) & 0x3f),
            0x80 | ((charcode >> 6) & 0x3f),
            0x80 | (charcode & 0x3f)
          );
        }
      }
      return String.fromCharCode.apply(String, utf8);
},

Upvotes: 3

Arnaud Gueras
Arnaud Gueras

Reputation: 2062

var content = fs.readFileSync('myfile.xml', {encoding:'ucs2'});
fs.writeFileSync('myfile.xml', content, {encoding:'utf8'});

Upvotes: 5

Related Questions