Brissy
Brissy

Reputation: 359

Headless Chrome ( Puppeteer ) - how to get access to document node element?

I'm using phantomJs to parse some content, get some info from it (max image size on page, for example), etc. I've decided to move to puppeteer. And i had faced the issue - in my functions, that was running at phantomJs, they were working with document node element. So, in puppeteer, as i understood, it's impossible to return node element from page.evaluate and other functions. So, is there any other way to overcome this problem? Or maybe i have to use another library? Thank you!

Upvotes: 12

Views: 14584

Answers (2)

sjjhsjjh
sjjhsjjh

Reputation: 333

The answer from Grant Miller discusses some methods and gives documentation links, but doesn't have any code. Here is some demonstration code that shows:

  • Getting an ElementHandle for the document body by calling the page.$ method.
  • Adding a class to the body by calling its classList.add method in the context of a page.evaluate method call.
  • Printing a PDF of the example.com web page with an orange background, to show that the class was added.

Code:

const parameters = {
    "launchParameters": { "args": [] },
    "gotoURI": "https://example.com",
    "marginSpecification": {"top": "0", "right": "0", "bottom": "0", "left": "0"},
    "pdfPath": "example.pdf",
    "styleTag":
        'body.orangey, body.orangey div {background-color: orange;}',
    "addBodyClass": "orangey",
    "footerTemplate": "<div></div>",
    "headerTemplate": "<div></div>",
};
console.log("Node version: " + process.version);

const puppeteer = require("puppeteer");

(async () => {
    console.log("await puppeteer.launch");
    const browser = await puppeteer.launch(parameters.launchParameters);

    console.log("await browser.newPage");
    const page = await browser.newPage();

    console.log("await page.goto");
    await page.goto(parameters.gotoURI, {waitUntil: 'networkidle2'});

    console.log("await page.addStyleTag");
    await page.addStyleTag({
        "content": parameters.styleTag
    });

    if (!!parameters.addBodyClass) {
        console.log("await page dollar.")
        const bodyHandle = await page.$('body');
        console.log("Body handle", (!!bodyHandle) ? "OK." : "no.");
        console.log(`await add class "${parameters.addBodyClass}"`);
        await page.evaluate(
            (body, addBodyClass) => body.classList.add(addBodyClass),
            bodyHandle, parameters.addBodyClass)
        .catch(error => console.log(error));
        console.log("await body handle dispose.");
        await bodyHandle.dispose();
    }

    const pdfOptions = {
        path: parameters.pdfPath,
        format: 'A4',
        margin: parameters.marginSpecification,
        displayHeaderFooter: true,
        printBackground: true,
        footerTemplate: parameters.footerTemplate,
        headerTemplate: parameters.headerTemplate
    };

    console.log("await page.pdf");
    await page.pdf(pdfOptions);

    console.log("await browser.close");
    await browser.close();

})();

Reference documentation for classList can be found here, for example: https://developer.mozilla.org/en-US/docs/Web/API/Element/classList

Upvotes: 4

Grant Miller
Grant Miller

Reputation: 29037

There are two environments to consider when using Puppeteer:

  1. Node.js Environment
  2. Page DOM Environment

The Node.js environment is built upon Google's Chrome V8 JavaScript engine.

Chrome V8 describes its relation to the DOM:

JavaScript is most commonly used for client-side scripting in a browser, being used to manipulate Document Object Model (DOM) objects for example. The DOM is not, however, typically provided by the JavaScript engine but instead by a browser. The same is true of V8—Google Chrome provides the DOM. V8 does however provide all the data types, operators, objects and functions specified in the ECMA standard.

In other words, the DOM is not provided by default to Node.js.

This means that Node.js does not have the capability to interpret DOM elements on its own.

This is where Puppeteer comes in.

The Puppeteer function page.evaluate() allows you to evaluate an expression in the current Page DOM context using Chrome or Chromium.

The Puppeteer documentation describes what happens when you attempt to return a non-serializable value, like a DOM element:

If the function passed to the page.evaluate returns a non-Serializable value, then page.evaluate resolves to undefined.

Again, this is because Node.js does not know how to interpret DOM elements without help.

As a result, Puppeteer has implemented an ElementHandle class which represents an in-page DOM element.

You can use elementHandle.$(), elementHandle.$$(), or elementHandle.$x() to return ElementHandles back to Node.js.

The ElementHandle class is serializable, so that it can be interpreted properly in the Node.js environment.

Therefore, if you need to manipulate an element directly, you can do so inside page.evaluate(). If you need to access a representation of an element, use page.$() or one of its related functions.

Upvotes: 9

Related Questions