Filippo Sebastio
Filippo Sebastio

Reputation: 1112

Dynamic scraping with Selenium and Python delivers no results

I am trying to scrape the following page using selenium to get the names of all the factories:

https://bangladeshaccord.org/factories

I am using the following code:

from bs4 import BeautifulSoup
from selenium import webdriver
import time
import pandas as pd


urlpage = "https://bangladeshaccord.org/factories"
print(urlpage)

driver = webdriver.Chrome(executable_path=r"C:\Users\filippo.sebastio\chromedriver.exe")

driver.get(urlpage)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
time.sleep(30)

results = driver.find_elements_by_xpath("//*[@id='factories']/div[3]/div/div/div[2]/div[3]/div[1]/div[2]/div[1]/div[2]/span[2]")
print('Number of results', len(results))

as a result I get

https://bangladeshaccord.org/factories

Number of results 1

Why do I get only one results? and why I can't even print it?

Thanks!

Upvotes: 0

Views: 89

Answers (3)

dmainz
dmainz

Reputation: 1025

If you want to get all company entries you can incrementally scroll down to the button of the page. As window.scrollTo didn't work here I just *document.getElementById('page-body').scrollTop = * here. Doing this all entries will be loaded.

def scroll_to_bottom(driver):
    scroll_y = driver.execute_script("return document.getElementById('page-body').scrollTop")
    driver.execute_script("document.getElementById('page-body').scrollTop = {};".format(scroll_y+500))
    new_scroll_y = driver.execute_script("return document.getElementById('page-body').scrollTop")
    while (scroll_y < new_scroll_y):
        driver.execute_script("document.getElementById('page-body').scrollTop = {};".format(new_scroll_y+500))
        scroll_y = new_scroll_y
        new_scroll_y = driver.execute_script("return document.getElementById('page-body').scrollTop")
        time.sleep(2)

And as stated in an other answer you have to use a different selector. Your code a little bit updated could then look like this (this one scrolls down the page and finally prints out the number of companies as well as a list of their names):

urlpage = "https://bangladeshaccord.org/factories"
print(urlpage)

webdriver.Chrome(executable_path=r"C:\Users\filippo.sebastio\chromedriver.exe")
driver.get(urlpage)
time.sleep(5)
scroll_to_bottom(driver)

results = driver.find_elements_by_class_name("sc-ldcLGC")

print('Number of results', len(results))
for res in results:
    company = res.find_element_by_css_selector('h2.sc-cAJUJo')
    print(company.get_attribute("textContent"))

Upvotes: 0

undetected Selenium
undetected Selenium

Reputation: 193338

To retrieve the number of results you need to induce WebDriverWait for the visibility_of_all_elements_located() and you can use the following Locator Strategies:

  • Code Block:

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    options = webdriver.ChromeOptions()
    options.add_argument("start-maximized")
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option('useAutomationExtension', False)
    driver = webdriver.Chrome(options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
    driver.get("https://bangladeshaccord.org/factories")
    driver.execute_script("arguments[0].scrollIntoView(true);",WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//h3[contains(., 'Accord Factories ')]"))))
    myLength = len(WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//p[./span[text()='Remediation Status:']]//preceding::h2[1]"))))
    print(myLength)
    driver.quit()
    

Upvotes: 0

Mario Kirov
Mario Kirov

Reputation: 351

Reason is because the xpath you are giving is only pointing to a specific element and that is why you get only a single result. You should use upper parrent div to get all result boxes and then get their children div tags and finally the h2 tag with the name. Problem remains what you gonna do with the load on scroll? Doing auto scroll in selenium is not a good idea if there is another better approach. Here's the solution. Checking the website, it makes GET/POST requests to an API to get all the data so you don't even have to use the UI and Selenium to get the data, you can use simple GET/POST requests. Here's a sample URL for factories searching with default filters on page 1:

https://accord2.fairfactories.org/api/v1/factories?status=active,inactive,no%20brand,pending%20closure&designation=completed,ontrack,behindschedule,capnotfinalised,notfinalized,initialcompleted&progress=0,1,2,3,4,5,6,7,8,9&language=en&limit=20&format=json&page=1

All parameters here are from the filters in the UI, so you need to customize them if you want to change the search result. Use the page parameter for the next pages (loading more on scroll in UI).

Now you have simple GET/POST requests and a JSON to pars.

Hope that helps.

Upvotes: 1

Related Questions