Reputation: 439
The script is supposed to get all the links from the base_url
which displays a subset of results and when scrolled down more results are added to the subset until the list is exhausted. I am able to do that but the issue is that I am only able to retrieve only those few links that load up initially when the web page shows up without performing any scroll. The response should be able to update alongside scroll by web driver. However, this is my code so far.
import re
import requests
import time
from bs4 import BeautifulSoup
from selenium import webdriver
mybrowser = webdriver.Chrome("E:\chromedriver.exe")
base_url = "https://genius.com/search?q="+"drake"
myheader = {'User-Agent':''}
mybrowser.get(base_url)
t_end = time.time() + 60 * 1
while(time.time()<t_end):
mybrowser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
response = requests.get(base_url, headers = myheader)
soup = BeautifulSoup(response.content, "lxml")
pattern = re.compile("[\S]+-lyrics$")
for link in soup.find_all('a',href=True):
if pattern.match(link['href']):
print (link['href'])
Only displays first few links. The links that load up when selenium scrolls the page are not retrieved.
Upvotes: 1
Views: 306
Reputation: 3898
You need to parse the HTML from Selenium itself (this changes when Selenium scrolls the webpage), and not use requests to download the page.
Change:
response = requests.get(base_url, headers = myheader)
soup = BeautifulSoup(response.content, "lxml")
to:
html = mybrowser.page_source
soup = BeautifulSoup(html, "lxml")
And it should work just fine.
Upvotes: 2