shekwo
shekwo

Reputation: 1447

Scraping paginated data loaded with Javascript

I am trying to use selenium and beautifulsoup to scrape videos off a website. The videos are loaded when the 'videos' tab is clicked (via JS I guess). When the videos are loaded, there is also the pagination where videos on each page is loaded on click (via JS I guess).

Here is how it looks

enter image description here

When I inspect element, here is what I get

enter image description here

My issue is I can't seem to get all videos across all pages, I can only get the first page. Here is my code,

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup as soup

import random
import time

chrome_options = webdriver.ChromeOptions()
prefs = {"profile.default_content_setting_values.notifications": 2}
chrome_options.add_experimental_option("prefs", prefs)
chrome_options.add_argument('--headless')
seconds = 5 + (random.random() * 5)
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.implicitly_wait(30)

driver.get("https://")
time.sleep(seconds)
time.sleep(seconds)

for i in range(1):
    element = driver.find_element_by_id("tab-videos")
    driver.execute_script("arguments[0].click();", element)
    time.sleep(seconds)
    time.sleep(seconds)
html = driver.page_source
page_soup = soup(html, "html.parser")

containers = page_soup.findAll("div", {"id": "tabVideos"})
for videos in containers:
    main_videos = videos.find_all("div", {"class":"thumb-block tbm-init-ok"})
print(main_videos)
driver.quit()

Please what am I missing here?

Upvotes: 3

Views: 967

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195408

The content is loaded from URL 'https://www.x***s.com/amateur-channels/ajibola_elizabeth/videos/best/{page}' where page goes from 0.

This script will print all video URLs:

import requests
from bs4 import BeautifulSoup


url = 'https://www.x***s.com/amateur-channels/ajibola_elizabeth/videos/best/{page}'

page = 0
while True:
    soup = BeautifulSoup(requests.get(url.format(page=page)).content, 'html.parser')

    for video in soup.select('div[id^="video_"] .title a'):
        u = video['href'].rsplit('/', maxsplit=2)
        print('https://www.x***s.com/video' + u[-2] + '/' + u[-1])

    next_page = soup.select_one('a.next-page')
    if not next_page:
        break

    page += 1

Upvotes: 3

Related Questions