anon20010813
anon20010813

Reputation: 155

puppeteer not scraping full information from website

I had a puppeteer scrape algorithm that scrapes youtube for the image URL source of videos but my current code only prints 4 strings of output with their URL source and the rest prints empty strings. To check if the error was only with the image source I added code for scraping the video titles as well and the video title scrape code prints all the titles without any empty string. What is the cause of this and how can I fix it to print all image URL sources? I taught of one potential reason why the image source would only be printing 4 strings which is, it might be because youtube has 4 thumbnails per row and the puppeteer is somehow only reading 1 row then printing empty strings for the others but the code I wrote for scraping video titles prints all the video titles which kind of disproves my hypothesis. Any help is appreciated. Thanks in advance.

const puppeteer = require('puppeteer');

async function scrape(url) {

    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto(url, {timeout: 0});

    const selector1 = 'ytd-thumbnail > a > yt-img-shadow > #img'
    const src1 = await page.$$eval(selector1, elems => elems.map(el => el.src))

    const selector2 = 'h3 > a > #video-title'
    const src2 = await page.$$eval(selector2,  elems => elems.map(el => el.textContent))

    browser.close();
    console.log({src1, src2})
}

scrape("http://www.youtube.com")

Upvotes: 0

Views: 399

Answers (1)

theDavidBarton
theDavidBarton

Reputation: 8861

It is an Infinite Scrolling behavior on Youtube that ensures the client browser only fetches the items once the user scrolled them into view. You can open DevTools elements tab and investigate that last (nth) ytd-rich-item-renderer:nth-child(n). You will see the yt-img-shadow inside:

<yt-img-shadow 
  ftl-eligible="" 
  class="style-scope ytd-thumbnail no-transition empty" 
  style="background-color: transparent;">
  <!--css-build:shady-->
  <img id="img" class="style-scope yt-img-shadow" alt="" width="9999">
</yt-img-shadow>

Then you scroll down until the element will be in view and the inner <img> will be changed:

<yt-img-shadow 
ftl-eligible="" 
class="style-scope ytd-thumbnail no-transition" 
style="background-color: transparent;" 
loaded="">
<!--css-build:shady-->
<img id="img" class="style-scope yt-img-shadow" alt="" width="9999" src="https://i.ytimg.com/vi/_{id}/hqdefault.jpg?sqp={parameter}">
</yt-img-shadow>

There are many answers on Stackoverflow how to deal with infinite scrolling with puppeteer.

Most probably you will need to use vanilla JS (e.g scrollTo) inside a page.evaluate to scroll as much as you want.

Upvotes: 1

Related Questions