Den
Den

Reputation: 16826

Using XPath in node.js

I am building a little document parser in node.js. To test, I have a raw HTML file, that is generally downloaded from the real website when the application executes.

I want to extract the first code example from each section of the Console.WriteLine that matches my constraint - it has to be written in C#. To do that, I have this sample XPath:

//*[@id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::div/following-sibling::div/pre[position()>1]/code[contains(@class,'lang-csharp')]

If I test the XPath online, I get the expected results, which is in this Gist.

In my node.js application, I am using xmldom and xpath to try and parse that exact same information out:

var exampleLookup = `//*[@id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::div/following-sibling::div/pre[position()>1]/code[contains(@class,'lang-csharp')]`;
var doc = new dom().parseFromString(rawHtmlString, 'text/html');
var sampleNodes = xpath.select(exampleLookup,doc);

This does not return anything, however.

What might be going on here?

Upvotes: 4

Views: 12203

Answers (2)

Hieu Van
Hieu Van

Reputation: 57

There is a library xpath-html that can help you using XPath to query HTML, with minimal efforts and lines of code.

const fs = require("fs");
const html = fs.readFileSync(`${__dirname}/shopback.html`, "utf8");

const xpath = require("xpath-html");
const node = xpath.fromPageSource(html).findElement("//*[contains(text(), 'with love')]");

console.log(`The matched tag name is "${node.getTagName()}"`);
console.log(`Your full text is "${node.getText()}"`);

Upvotes: 3

Daniel Haley
Daniel Haley

Reputation: 52888

This is most likely caused by the default namespace (xmlns="http://www.w3.org/1999/xhtml") in your HTML (XHTML).

Looking at the xpath docs, you should be able to bind the namespace to a prefix using useNamespaces and use the prefix in your xpath (untested)...

var exampleLookup = `//*[@id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::x:div/following-sibling::x:div/x:pre[position()>1]/x:code[contains(@class,'lang-csharp')]`;
var doc = new dom().parseFromString(rawHtmlString, 'text/html');
var select = xpath.useNamespaces({"x": "http://www.w3.org/1999/xhtml"});
var sampleNodes = xpath.select(exampleLookup,doc);

Instead of binding the namespace to a prefix, you could also use local-name() in your XPath, but I wouldn't recommend it. This is also covered in the docs.

Example...

//*[@id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::*[local-name()='div']/following-sibling::*[local-name()='div']/*[local-name()='pre'][position()>1]/*[local-name()='code'][contains(@class,'lang-csharp')]

Upvotes: 5

Related Questions