robots.txt
robots.txt

Reputation: 149

Can't extract next page link using xpath within puppeteer

I'm trying to figure out a way to scrape next page link from a webpage using xpath within puppeteer. When I execute the script, I can see that the script gets gibberish result even when the xpath is correct. How can I fix it?

const puppeteer = require("puppeteer");
const base = "https://www.timesbusinessdirectory.com";
let url = "https://www.timesbusinessdirectory.com/company-listings";

(async () => {
    const browser = await puppeteer.launch({headless:false});
    const [page] = await browser.pages();
    await page.goto(url,{waitUntil: 'networkidle2'});
    page.waitForSelector(".company-listing");
    const nextPageLink = await page.$x("//a[@aria-label='Next'][./span[@aria-hidden='true'][contains(.,'Next')]]", item => item.getAttribute("href"));
    url = base.concat(nextPageLink);
    console.log("========================>",url)
    await browser.close();
})();

Current output:

https://www.timesbusinessdirectory.comJSHandle@node

Expected output:

https://www.timesbusinessdirectory.com/company-listings?page=2

Upvotes: 1

Views: 504

Answers (1)

ggorlen
ggorlen

Reputation: 57394

First of all, there's a missing await on page.waitForSelector(".company-listing");. Not awaiting this defeats the point of the call entirely, but it could be that it incidentally works since the very strict waitUntil: "networkidle2" covers the selector you're interested in anyway, or the xpath is statically present (I didn't bother to check).

Generally speaking, if you're using waitForSelector right after a page.goto, waitUntil: "networkidle2" only slows you down. Only keep it if there's some content you need on the page other than the waitForSelector target, otherwise you're waiting for irrelevant requests that are pulling down images, scripts and data potentially unrelated to your primary target. If it's a slow-loading page, then increasing the timeout on your waitFor... is the typical next step.

Another note is that it's sort of odd to waitForSelector on some CSS target, then try to select an xpath immediately afterwards. It seems more precise to waitForXPath, then call $x on the exact same xpath pattern twice.

Next, let's look at the docs for page.$x:

page.$x(expression)

expression <string> Expression to evaluate.

returns: <Promise<Array<ElementHandle>>>

The method evaluates the XPath expression relative to the page document as its context node. If there are no such elements, the method resolves to an empty array.

Shortcut for page.mainFrame().$x(expression)

So, unlike evaluate, $eval and $$eval, $x takes 1 parameter and resolves to an elementHandle array. Your second parameter callback doesn't get you the href like you think -- this only works on eval-family functions.

In addition to consulting the docs, you can also console.log the returned value to confirm the behavior. The JSHandle@node you're seeing in the URL isn't gibberish, it's the stringified form of the JSHandle object and provides information you can cross-check against the docs.

The solution is to grab the first elementHandle from the array returned by the function and then evaluate on that handle using your original callback:

const puppeteer = require("puppeteer");

const url = "https://www.timesbusinessdirectory.com/company-listings";

let browser;
(async () => {
  browser = await puppeteer.launch({headless: true});
  const [page] = await browser.pages();
  await page.goto(url);
  const xp = `//a[@aria-label='Next']
    [./span[@aria-hidden='true'][contains(.,'Next')]]`;
  await page.waitForXPath(xp);
  const [nextPageLink] = await page.$x(xp);
  const href = await nextPageLink.evaluate(el => el.getAttribute("href"));
  console.log(href); // => /company-listings?page=2
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close())
;

As an aside, there's also el => el.href for grabbing the href attribute. .href includes the base URL here, so you won't need to concatenate. In general, behavior differs beyond delivering the absolute vs relative path, so it's good to know about both options.

Upvotes: 1

Related Questions