Reputation: 230
I am working on a Web scraping project. The URL for the website I am scraping is https://www.beliani.de/sofas/ledersofa/
I am scraping all the links of products listed on this page. I tried getting links using both Requests-HTML and Selenium. But I am getting 57 and 24 links respectively. While there are more than 150 products listed on the page. Below are the code blocks I am using.
Using Selenium:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from time import sleep
options = Options()
options.add_argument("user-agent = Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36")
#path to crome driver
DRIVER_PATH = 'C:/chromedriver'
driver = webdriver.Chrome(executable_path=DRIVER_PATH, chrome_options=options)
url = 'https://www.beliani.de/sofas/ledersofa/'
driver.get(url)
sleep(20)
links = []
for a in driver.find_elements_by_xpath('//*[@id="offers_div"]/div/div/a'):
print(a)
links.append(a)
print(len(links))
Using Request-HTML:
from requests_html import HTMLSession
url = 'https://www.beliani.de/sofas/ledersofa/'
s = HTMLSession()
r = s.get(url)
r.html.render(sleep = 20)
products = r.html.xpath('//*[@id="offers_div"]', first = True)
#Getting 57 links using below block:
links = []
for link in products.absolute_links:
print(link)
links.append(link)
print(len(links))
I am not getting which step I am doing wrong or what is missing.
Upvotes: 0
Views: 516
Reputation: 829
I had an issue not finding all the links and the fix was to make sure the popups were all X'd. Otherwise, you may have elements messed with when popups are on top.
Upvotes: 0
Reputation: 193108
To extract the total number of links using Selenium and python you need to accept the cookies and you have to induce WebDriverWait for visibility_of_all_elements_located()
and you can use either of the following Locator Strategies:
Using CSS_SELECTOR
:
driver.get("https://www.beliani.de/sofas/ledersofa/")
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[value='Akzeptieren']"))).click()
print(len(WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div#offers_div > div > div > a[href]")))))
Using XPATH
:
driver.get("https://www.beliani.de/sofas/ledersofa/")
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//input[@value='Akzeptieren']"))).click()
print(len(WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@id='offers_div']/div/div/a[@href]")))))
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Upvotes: 1
Reputation: 831
You have to scroll through the website and reach the end of the page in order to load all the scripts in the webpage. Just by opening the website we will load only the script that is necessary to view that particular section of the webpage. Therefore when you ran your code it could retrieve data from only those scripts that were loaded.
This one gave me 160 links :
driver.get('https://www.beliani.de/sofas/ledersofa/')
sleep(3)
#gets the whole height of the document
height = driver.execute_script('return document.body.scrollHeight')
# now break the webpage into parts so that each section in the page is scrolled through to load
scroll_height = 0
for i in range(10):
scroll_height = scroll_height + (height/10)
driver.execute_script('window.scrollTo(0,arguments[0]);',scroll_height)
sleep(2)
# I have used the 'class' locator you can use anything you want once we have completed the loop
a_tags = driver.find_elements_by_class_name('itemBox')
count = 0
for i in a_tags:
if i.get_attribute('href') is not None:
print(i.get_attribute('href'))
count+=1
print(count)
driver.quit()
Upvotes: 1