Mike Marshall
Mike Marshall

Reputation: 7850

Parse UTF-8 XML in javascript

I'm trying to load and parse a simple utf-8-encoded XML file in javascript using node and the xpath and xmldom packages. There are no XML namespaces used and the same XML parsed when converted to ASCII. I can see in the debugger in VS Code that the string has embedded spaces in between each character (surely due to loading the utf-8 file incorrectly) but I can't find a way to properly load and parse the utf-8 file.

Code:

var xpath = require('xpath')
  , dom = require('xmldom').DOMParser;

const fs = require('fs');

var myXml = "path_to_my_file.xml";

var xmlContents = fs.readFileSync(myXml, 'utf8').toString();

// this line causes errors parsing every single tag as the tag names have spaces in them from improper utf-8 decoding
var doc = new dom().parseFromString(xmlContents, 'application/xml');
var cvNode = xpath.select1("//MyTag", doc);

console.log(cvNode.textContent);

The code works fine if the file is ASCII (textContent has the proper data), but if it is UTF-8 then there are a number of parsing errors and cvNode is undefined.

Is there a proper way to parse UTF-8 XML in node/javascript? I can't for the life of me find a decent example.

Upvotes: 1

Views: 3035

Answers (1)

NineBerry
NineBerry

Reputation: 28499

When you see additional white spaces between each letter, this suggests that the file isn't actually encoded using utf-8 but uses a 16 bit unicode encoding.

Try 'utf16le'.

For a list of supported encodings see Buffers and Character Encodings.

Upvotes: 1

Related Questions