Alator
Alator

Reputation: 508

Puppeteer is unable to get the complete source code

I'm creating a simple scraping application with Node.js and Puppeteer. The page I'm trying to scrape is this. Below is the code I'm using right now.

const url = `https://www.betrebels.gr/el/sports?catids=122,40,87,28,45,2&champids=423,274616,1496978,1484069,1484383,465990,465991,91,71,287,488038,488076,488075,1483480,201,2,367,38,1481454,18,226,440,441,442,443,444,445,446,447,448,449,451,452,453,456,457,458,459,460,278261&datefilter=TodayTomorrow&page=prelive`
await page.goto(url, {waitUntil: 'networkidle2'});
let content: string = await page.content();
await page.screenshot({path: 'page.png',fullPage: true});
await fs.writeFile("temp.html", content);
//...Analyze the html and other stuff.

The screenshot I'm getting is this which is what I'm expecting.

On the other hand, the page content is minimal and doesn't represent the data on the image.

Am I doing something wrong? Am I not waiting properly for the Javascript to finish?

enter image description here

Upvotes: 2

Views: 1005

Answers (1)

Thomas Dondorf
Thomas Dondorf

Reputation: 25280

The page is using frames. You are only seeing the main content of the page (without the content of the frames). To also get the content of the frame, you need to first find the frame (e.g. via page.$) and then get its frame handle via elementHandle.contentFrame. You can then call frame.content() to get the content of the frame.

Simple Example

const frameElementHandle = await page.$('#selector iframe');
const frame = await frameElementHandle.contentFrame();
const frameContent = await frame.content();

Depending on the structure of the page, you need to do this for multiple frames to get all contents or you even need to do it for a frame inside the frame (what seems to be the case for the given page).

Example to read all frame contents

Below is an example that recursively read the contents of all frames on the page.

const contents = [];
async function extractFrameContents(pageOrFrame) {
  const frames = await pageOrFrame.$$('iframe');
  for (let frameElement of frames) {
    const frame = await frameElement.contentFrame();
    const frameContent = await frame.content();

    // do something with the content, example:
    contents.push(frameContent);

    // recursively repeat
    await extractFrameContents(frame); 
  }
}
await extractFrameContents(page);

Upvotes: 2

Related Questions