Reputation: 1447
I am trying to use selenium and beautifulsoup to scrape videos off a website. The videos are loaded when the 'videos' tab is clicked (via JS I guess). When the videos are loaded, there is also the pagination where videos on each page is loaded on click (via JS I guess).
Here is how it looks
When I inspect element, here is what I get
My issue is I can't seem to get all videos across all pages, I can only get the first page. Here is my code,
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup as soup
import random
import time
chrome_options = webdriver.ChromeOptions()
prefs = {"profile.default_content_setting_values.notifications": 2}
chrome_options.add_experimental_option("prefs", prefs)
chrome_options.add_argument('--headless')
seconds = 5 + (random.random() * 5)
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.implicitly_wait(30)
driver.get("https://")
time.sleep(seconds)
time.sleep(seconds)
for i in range(1):
element = driver.find_element_by_id("tab-videos")
driver.execute_script("arguments[0].click();", element)
time.sleep(seconds)
time.sleep(seconds)
html = driver.page_source
page_soup = soup(html, "html.parser")
containers = page_soup.findAll("div", {"id": "tabVideos"})
for videos in containers:
main_videos = videos.find_all("div", {"class":"thumb-block tbm-init-ok"})
print(main_videos)
driver.quit()
Please what am I missing here?
Upvotes: 3
Views: 967
Reputation: 195408
The content is loaded from URL 'https://www.x***s.com/amateur-channels/ajibola_elizabeth/videos/best/{page}'
where page goes from 0
.
This script will print all video URLs:
import requests
from bs4 import BeautifulSoup
url = 'https://www.x***s.com/amateur-channels/ajibola_elizabeth/videos/best/{page}'
page = 0
while True:
soup = BeautifulSoup(requests.get(url.format(page=page)).content, 'html.parser')
for video in soup.select('div[id^="video_"] .title a'):
u = video['href'].rsplit('/', maxsplit=2)
print('https://www.x***s.com/video' + u[-2] + '/' + u[-1])
next_page = soup.select_one('a.next-page')
if not next_page:
break
page += 1
Upvotes: 3