d512
d512

Reputation: 34083

Strange unicode characters when reading in file in node.js app

I am attempting to write a node app that reads in a set of files, splits them into lines, and puts the lines into an array. Pretty simple. It works on quite a few files except some SQL files that I am working with. For some reason I seem to be getting some kind of unicode output when I split the lines up. The app looks something like this:

fs = require("fs");
var data = fs.readFileSync("test.sql", "utf8");
console.log(data);
lines = data.split("\n");
console.log(lines);

The input file looks something like this:

use whatever
go

The output looks like this:

��use whatever
go

[ '��u\u0000s\u0000e\u0000 \u0000w\u0000h\u0000a\u0000t\u0000e\u0000v\u0000e\u0000r\u0000',
  '\u0000g\u0000o\u0000',
  '\u0000' ]

As you can see there is some kind of unrecognized character at the beginning of the file. After reading the data in and directly outputting it, it looks okay except for this character. However, if I then attempt to split it up into lines, I get all these unicode-like characters. Basically it's all the actual characters with "\u0000" at the beginning of each one.

I have no idea what's going on here but it appears to have something to do with the characters in the file itself. If I copy and paste the text of the file into another new file and run the app on the new file, it works fine. I assume that whatever is causing this issue is being stripped out during the copy and paste process.

Upvotes: 16

Views: 31124

Answers (6)

Alexander77
Alexander77

Reputation: 132

Here is simple function to read any Unicode file

function readFile(filename)
{
  let body = "";
  const buffer = fs.readFileSync(filename);
  //['ascii', 'utf8', 'utf16le', 'ucs2', 'latin1', 'binary']
  const ch1 = buffer[0];
  const ch2 = buffer[1];
  if (ch1 == 0xff && ch2 == 0xfe)
  {
    body = buffer.toString('utf16le');
  }
  else if (ch1 == 0xfe && ch2 == 0xff)
  {
    body = buffer.toString('ucs2');
  }
  else
  {
    const ch3 = buffer[2];
    if (ch1 == 0xef && ch2 == 0xbb && ch3 == 0xbf)
    {
      body = buffer.toString('utf8');
    }
    else
    {
      body = buffer.toString('ascii');
    }    
  }
  
  return body;
}

Upvotes: 0

Deepak Yadav
Deepak Yadav

Reputation: 1772

var data = readFileSync('filename.xml', 'utf8');
data.replaceAll('\ufffd', '')

worked for me for Node v19.9.0

Upvotes: 0

Chong Lip Phang
Chong Lip Phang

Reputation: 9279

I did the following in Windows command prompt to convert the endianness:

type file.txt > file2.txt

Upvotes: 1

Vikas
Vikas

Reputation: 24322

Use the lite version of Iconv-lite

var result= "";
var iconv = require('iconv-lite');
var stream = fs.createReadStream(sourcefile)
    .on("error",function(err){
        //handle error
    })
    .pipe(iconv.decodeStream('win1251'))
    .on("error",function(err){
        //handle error
    })
    .on("data",function(data){
        result += data;
    })
    .on("end",function(){
       //use result
    });

Upvotes: 0

Esailija
Esailija

Reputation: 140210

Your file is in UTF-16 Little Big Endian, not UTF-8.

var data = fs.readFileSync("test.sql", "utf16le"); //Not sure if this eats the BOM

Unfortunately node.js only supports UTF-16 Little Endian or UTF-16LE (Can't be sure from reading docs, there is a slight difference between them; namely that UTF-16LE does not use BOMs), so you have to use iconv or convert the file to UTF-8 some other way.

Example:

var Iconv  = require('iconv').Iconv,
    fs = require("fs");

var buffer = fs.readFileSync("test.sql"),
    iconv = new Iconv( "UTF-16", "UTF-8");

var result = iconv.convert(buffer).toString("utf8");

Upvotes: 29

Halcyon
Halcyon

Reputation: 57709

Is this perhaps the BOM (Byte-Order-Mark)? Make sure you save your files without the BOM or include code to strip the BOM.

The BOM is usually invisible in text editors.

I know Notepad++ has a feature where you can easily strip a BOM from a file. Encoding > Encode in UTF-8 without BOM.

Upvotes: 2

Related Questions