How to get all the data from a webpage manipulating lazy-loading method?

Question

I've written some script in python using selenium to scrape name and price of different products from redmart website. My scraper clicks on a link, goes to its target page, parses data from there. However, the issue I'm facing with this crawler is it scrapes very few items from a page because of the webpage's slow-loading method. How can I get all the data from each page controlling the lazy-loading process? I tried with "execute script" method but i did it wrongly. Here is the script I'm trying with:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get("https://redmart.com/bakery")
wait = WebDriverWait(driver, 10)

counter = 0    
while True:

    try:
        wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "li.image-facets-pill")))
        driver.find_elements_by_css_selector('img.image-facets-pill-image')[counter].click()      
        counter += 1    
    except IndexError:
        break 

    # driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    for elems in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "li.productPreview"))):
        name = elems.find_element_by_css_selector('h4[title] a').text
        price = elems.find_element_by_css_selector('span[class^="ProductPrice__"]').text
        print(name, price)

    driver.back()
driver.quit()

jlaur · Accepted Answer

I guess you could use Selenium for this but if speed is your concern aften @Andersson crafted the code for you in another question on Stackoverflow, well, you should replicate the API calls, that the site uses instead and extract the data from the JSON - like the site does.

If you use Chrome Inspector you'll see that the site for each of those categories that are in your outer while-loop (the try-block in your original code) calls an API, that returns the overall categories of the site. All this data can be retrieved like so:

categories_api = 'https://api.redmart.com/v1.5.8/catalog/search?extent=0&depth=1'
r = requests.get(categories_api).json()

For the next API calls you need to grab the uris concerning the bakery stuff. This can be done like so:

bakery_item = [e for e in r['categories'] if e['title'] == 'Bakery]
children = bakery_item[0]['children']
uris = [c['uri'] for c in children]

Uris will now be a list of strings (['bakery-bread', 'breakfast-treats-212', 'sliced-bread-212', 'wraps-pita-indian-breads', 'rolls-buns-212', 'baked-goods-desserts', 'loaves-artisanal-breads-212', 'frozen-part-bake', 'long-life-bread-toast', 'speciality-212']) that you'll pass on to another API found by Chrome Inspector, and that the site uses to load content.

This API has the following form (default returns a smaller pageSize but I bumped it to 500 to be somewhat sure you get all data in one request):

items_API = 'https://api.redmart.com/v1.5.8/catalog/search?pageSize=500&sort=1024&category={}'

for uri in uris:
    r = requests.get(items_API.format(uri)).json()
    products = r['products']
    for product in products:
        name = product['title']
        # testing for promo_price - if its 0.0 go with the normal price
        price = product['pricing']['promo_price']
        if price == 0.0:
            price = product['pricing']['price']
        print("Name: {}. Price: {}".format(name, price))

Edit: If you want to stick to selenium still, you could insert something like this to hansle the lazy loading. Questions on scrolling has been answered several times before, so yours is actually a duplicate. In the future you should showcase what you tried (you own effort on the execute part) and show the traceback.

check_height = driver.execute_script("return document.body.scrollHeight;") 
while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(5)
    height = driver.execute_script("return document.body.scrollHeight;") 
    if height == check_height: 
        break 
     check_height = height

How to get all the data from a webpage manipulating lazy-loading method?

Answers (1)

Related Questions