Reputation: 284
So I have an HTML excerpt from a webpage as follows:
<li class="PaEvOc tv5olb wbTnP gws-horizon-textlists__li-ed">
//random div/element stuff inside here
</li>
<li class ="PaEvOc tv5olb gws-horizon-textlists__li-ed">
//random div/element stuff inside here as well
</li>
Not sure how to properly copy HTML but if you look at "events near location" on Google Chrome, I'm looking at these and trying to scrape the data from them:
https://i.sstatic.net/fv4a4.png
To start, I'm just trying to figure out how to properly select these elements in Puppeteer:
(async () => {
const browser = await puppeteer.launch({ args: [
'--no-sandbox'
]});
const page = await browser.newPage();
page.once('load', () => console.log('Page loaded!'));
await page.goto('https://www.google.com/search?q=events+near+poughkeepsie+today&client=safari&rls=en&uact=5&ibp=htl;events&rciv=evn&sa=X&fpstate=tldetail');
console.log('Hit wait for selector')
const test = await page.waitForSelector(".PaEvOc");
console.log('finished waiting for selector');
const seeMoreEventsButton = await page.$(".PaEvOc");
console.log('seeMoreEventsButton is ' + seeMoreEventsButton);
console.log('test is ' + test);
})();
What exactly is the problem here? Any and all help much appreciated, thank you!
Upvotes: 1
Views: 847
Reputation: 4847
I suggest reading this: https://intoli.com/blog/not-possible-to-block-chrome-headless/
Basically, websites are detecting that you are scraping, but you can work around it.
Here is what I did to make your console logs print something useful
const puppeteer = require('puppeteer');
(async () => {
const preparePageForTests = async (page) => {
const userAgent = 'Mozilla/5.0 (X11; Linux x86_64)' +
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.39 Safari/537.36';
await page.setUserAgent(userAgent);
}
const browser = await puppeteer.launch({ args: [
'--no-sandbox'
]});
const page = await browser.newPage();
await preparePageForTests(page);
page.once('load', () => console.log('Page loaded!'));
await page.goto('https://www.google.com/search?q=events+near+poughkeepsie+today&client=safari&rls=en&uact=5&ibp=htl;events&rciv=evn&sa=X&fpstate=tldetail');
console.log('Hit wait for selector')
const test = await page.waitForSelector(".PaEvOc");
console.log('finished waiting for selector');
const seeMoreEventsButton = await page.$(".PaEvOc");
console.log('seeMoreEventsButton is ' + seeMoreEventsButton);
console.log('test is ' + test);
})();
Upvotes: 1