Web scraping with page limit

Question

I have been trying to scrape the products in this site (https://www.americanas.com.br/hotsite/todas-ofertas-mundo) with BeautifulSoup. I can get all items in one page, and since the pagination is within the url just move on to the next with a counter (e.g page 2 is https://www.americanas.com.br/hotsite/todas-ofertas-mundo/pagina-2 and so on). So the problem is that the max page is 416, after that number page does not show any products. Since each site shows 24 products I can barely reach 10k products (of a total of 4 million according to the page).

I attempted to go deeper into the categories but I reach same problem (some deeper categories also has more than 10k products), also to filter with "marca", "price" and "loja" and the same issue. So even with the finest filter I can't get all the products because I reach the max page within it.

I also searched for an API so I can try bypass throught this, but couldn´t found any that I can request the catalog without the product ID. I did find one for getting the different brands and sellers, but the same issue of quantity of products.

This is an issue that has troubled me in other marketplaces as well, making it really hard to get ALL the catalog of the site, having to filter and trying to get the most amount of products but not all of them. So any suggestions are very welcomed. Thank you!

Here the code to scrape the page

import requests
import time
import re
import json
from bs4 import BeautifulSoup
import urllib3
urllib3.disable_warnings()

def html(url):
    try:
        soup = BeautifulSoup(requests.get(url,verify = False).content,'html.parser',from_encoding="utf-8")
        return soup
    except Exception as e:
        print(e)
        print("Not loading")

def product_info(prod):
    quote = {}
    if prod.find("span",{"class": re.compile(r"UnavailableTextMessage")}):
        return 
    quote['id'] = prod.find("a").get("href").replace("?","/",1).split("/")[2]
    quote['name'] = prod.find("h2").getText()
    #prod_info = prod.find("span")
    quote['price'] = prod.find("span", {"class": re.compile(r"PriceUI-bwhjk3-11")}).getText().replace(".","").split(" ")[-1]
    quote['full_price'] = quote['price']
    quote['discount'] = ''
    discount = prod.find("span",{"class": re.compile(r"TextUI-xlll2j-3")})
    if discount:
        quote["discount"] = discount.getText().replace("%","")
        disc = quote['discount'].replace("%","")
        quote['full_price'] = prod.find("span", {"class": re.compile(r"PriceUI-sc-1q8ynzz-0")}).getText().replace(".","").split(" ")[-1]
    inter = prod.find("span", {"class": re.compile(r"InternationalText")})
    quote['inter'] = 0
    if inter:
        quote['inter'] = 1
    quote['url']= "https://americanas.com.br"+ prod.find("a").get("href")
    return quote



test = "https://www.americanas.com.br/hotsite/todas-ofertas-mundo"
products = [] ##lim 10k
counter = 2
while True:
    page = "/pagina-"+str(counter)
    url = test + page
    counter += 1
    soup = html(url)
    print(url)        
    content = soup.findAll("div",{"class": "product-grid-item"}) 
    if content == []:
        print(counter)
        break
    for cont in content:
        if product_info(cont):
            products.append(product_info(cont))

Web scraping with page limit

Answers (1)

Related Questions