Reputation: 23
I have a project to scrape the products purchased by certain customers from an internal CRM. This CRM uses a lot of dynamically loaded tiles, so there are not many consistent class names (many have an ID randomly appending at each page load), and there are also many different reports/elements on a page with the same class name, so I can't query the whole page for an element selector.
I have identified the "parent" element that I want via xpath. I then want to drill down and get the innerText of only the children who match the query selector (most threads I see have people doing the query selector on the whole page, this will get results from menus I don't want).
I can do this in regular Javascript in the console of the browser, I just can't figure out how to do it in Node/Puppeteer. Here's what I have so far:
//Getting xpath of the "box" that contains all of the product tiles that a customer has
const productsBox = await page.$x("/html/body/blah/blah/blah");
This is where it breaks down. I'm not super familiar with some of the syntax or understanding Puppeteer's documentation, but I've tried a few different methods (I'm also not comfortable enough with functions to use the => format. The Puppeteer documentation has an example of what I'm trying to do, but I tried with the same structure and it also returned nothing):
//Tried using the elementHandle.$$eval approach on the zero index of my xpath results,
//but doesn't return anything when I console.log(productsList)
const productsList = await productsBox[0].$$eval('.title-heading', function parseAndText (products) {
productsList=[];
for (i=0; i<products.length; i++) {
productsList.push(products[i].innerText.trim());
}
return productsList;
}
);
//Tried doing the page.$$eval approach with selector, passing in the zero index of my xpath
const productsList = await page.$$eval('.title-heading', function parseAndText (products) {
productsList=[];
for (i=0; i<products.length; i++) {
productsList.push(products[i].innerText.trim());
}
return productsList;
}, productsBox[0]
//Tried the page.evaluate and then page.evaluateHandle approach on the zero index of my xpath,
//doing the query selection inside the evaluation and then doing something with that.
let productsList= await page.evaluateHandle(function parseAndText(productsBoxZero) {
productsInnerList = productsBoxZero.querySelectorAll(".title-heading");
productsList=[];
for (i=0; i<productsInnerList.length; i++) {
productsList.push(productsInnerList[i].innerText.trim());
//Threw a console log here to see if it does anything,
//But nothing is logged
console.log("Pushed product " + i + " into the product list");
}
return productsList;
}, productsBox[0]);
In terms of output, I've console logged some of the variables and I get this:
productsBox is JSHandle@node
productsBox[0] is JSHandle@node
productList is
For comparison, I was doing this in parallel via Javascript in the console to make sure I'm stepping through the logic correctly and I get what I expect:
>productsBox=$x("/html/body/blah/blah/blah");
>productsInnerList=productsBox[0].querySelectorAll(".title-heading");
>productsInnerList.length;
//2, and this customer has 2 products
>productsList=[];
>for (i=0; i<productsInnerList.length; i++) {
productsList.push(productsInnerList[i].innerText.trim());
};
>console.log(productsList)
>["Product 1", "Product 2"]
Thanks for reading this far and I appreciate your help!
[Edit]
For some additional research, I have tried to use page.evaluateHandle and tried to log my variables so far:
productsBox is JSHandle@node
productsBox[0] is JSHandle@node
productList is JSHandle@array
Which is progress. I tried to do:
let productsText=await productsList.jsonValue();
But when I try to output I get nothing:
await console.log("productsText is " + productsText);
productsBox is JSHandle@node
productsBox[0] is JSHandle@node
productList is JSHandle@array
productsText is
Upvotes: 1
Views: 4471
Reputation: 3033
I'd suggest reading the docs carefully before trying every function.
$$eval
evaluates on the selector and passing the element is pointless in this case. evaluateHandle
is for returning in-page elements, since you're returning an array of text and it's serializable, you don't need it. All you need is to pass the element to page.evaluate
or do everything in puppeteer context.
To be able to see in-page console.log you need to:
page.on('console', msg => console.log(msg.text()));
page.evaluate
let productsList= await page.evaluate((element) => {
const productsInnerList = element.querySelectorAll(".title-heading");
const productsList=[];
for (const el of productsInnerList) {
productsList.push(el.innerText.trim());
console.log("Pushed product " + el.innerText.trim() + " into the product list");
}
return productsList;
}, productsBox[0]);
elementHandle.$$
const productList = [];
const productsInnerList = await productsBox[0].$$('.title-heading');
for (const element of productsInnerList){
const innerText = await (await element.getProperty('innerText')).jsonValue();
productList.push(innerText);
}
Upvotes: 1
Reputation: 23
Based on @mbit's answer I was able to get it to work. I first tested on another site that was similar in structure to mine. Copied code over to my original site and it still wasn't working, only got a null output. Turns out that while I had an await page.$x(full/xpath) for the parent element, the child elements that contained the innerText still hadn't loaded. So I did two things:
1) Added another await page.$x(full/xpath) for the first element in the list that was one of my targets 2) Implemented the page.evaluate approach provided by mbit. 2a) Explicitly wrote out the function (still wrapping head around the => structure)
Final code below (some variable names changed as a result of testing):
let productsTextList= await page.evaluate(function list(list) {
const productsInnerList = list.querySelectorAll(".title-heading");
productsTextList =[];
for (n=0; n<productsInnerList.length; n++) {
product=productsInnerList[n].innerText.trim();
productsTextList.push(product);
}
return productsTextList;
}, productsBox[0]);
console.log(productsTextList);
I chose the page.evaluate approach because it more closely matched what I was doing in the browser console, so easy to test with. The trick with the elementHandle.$$ approach was, as mbit mentioned, using await element.getProperty('innerText')
rather than .innerText
. Throughout troubleshooting and learning, I also stumbled across this thread on GitHub which also talks about how to extract it (same as mbit's approach above). For anyone running into similar issues you aren't alone!
Upvotes: 0