Espuky
Espuky

Reputation: 33

Web scraping with page limit

I have been trying to scrape the products in this site (https://www.americanas.com.br/hotsite/todas-ofertas-mundo) with BeautifulSoup. I can get all items in one page, and since the pagination is within the url just move on to the next with a counter (e.g page 2 is https://www.americanas.com.br/hotsite/todas-ofertas-mundo/pagina-2 and so on). So the problem is that the max page is 416, after that number page does not show any products. Since each site shows 24 products I can barely reach 10k products (of a total of 4 million according to the page).

I attempted to go deeper into the categories but I reach same problem (some deeper categories also has more than 10k products), also to filter with "marca", "price" and "loja" and the same issue. So even with the finest filter I can't get all the products because I reach the max page within it.

I also searched for an API so I can try bypass throught this, but couldn´t found any that I can request the catalog without the product ID. I did find one for getting the different brands and sellers, but the same issue of quantity of products.

This is an issue that has troubled me in other marketplaces as well, making it really hard to get ALL the catalog of the site, having to filter and trying to get the most amount of products but not all of them. So any suggestions are very welcomed. Thank you!

Here the code to scrape the page

import requests
import time
import re
import json
from bs4 import BeautifulSoup
import urllib3
urllib3.disable_warnings()

def html(url):
    try:
        soup = BeautifulSoup(requests.get(url,verify = False).content,'html.parser',from_encoding="utf-8")
        return soup
    except Exception as e:
        print(e)
        print("Not loading")

def product_info(prod):
    quote = {}
    if prod.find("span",{"class": re.compile(r"UnavailableTextMessage")}):
        return 
    quote['id'] = prod.find("a").get("href").replace("?","/",1).split("/")[2]
    quote['name'] = prod.find("h2").getText()
    #prod_info = prod.find("span")
    quote['price'] = prod.find("span", {"class": re.compile(r"PriceUI-bwhjk3-11")}).getText().replace(".","").split(" ")[-1]
    quote['full_price'] = quote['price']
    quote['discount'] = ''
    discount = prod.find("span",{"class": re.compile(r"TextUI-xlll2j-3")})
    if discount:
        quote["discount"] = discount.getText().replace("%","")
        disc = quote['discount'].replace("%","")
        quote['full_price'] = prod.find("span", {"class": re.compile(r"PriceUI-sc-1q8ynzz-0")}).getText().replace(".","").split(" ")[-1]
    inter = prod.find("span", {"class": re.compile(r"InternationalText")})
    quote['inter'] = 0
    if inter:
        quote['inter'] = 1
    quote['url']= "https://americanas.com.br"+ prod.find("a").get("href")
    return quote



test = "https://www.americanas.com.br/hotsite/todas-ofertas-mundo"
products = [] ##lim 10k
counter = 2
while True:
    page = "/pagina-"+str(counter)
    url = test + page
    counter += 1
    soup = html(url)
    print(url)        
    content = soup.findAll("div",{"class": "product-grid-item"}) 
    if content == []:
        print(counter)
        break
    for cont in content:
        if product_info(cont):
            products.append(product_info(cont))

Upvotes: 1

Views: 1495

Answers (1)

Paul M.
Paul M.

Reputation: 10799

You had the right idea with looking for an API. If you log your network traffic, and visit one of the product pages, you'll see requests being made to several APIs.

The first one returns a collection of product ids. Notice the query string parameters offset and limit. In this example, I set the offset to "0" (so that we start at the first product), and the limit to "10", to retrieve the product ids for the first ten products:

def main():

    import requests

    url = "https://mystique-v2-americanas.juno.b2w.io/search"

    params = {
        "offset": "0",
        "sortBy": "topSelling",
        "source": "omega",
        "filter": [
            '{"id":"referer","value":"/hotsite/todas-ofertas-mundo","fixed":true,"hidden":true}',
            '{"id":"currency","value":"USD","fixed":true,"name":"moeda","hidden":true}'
        ],
        "limit": "10",
        "suggestion": "true"
    }

    response = requests.get(url, params=params)
    response.raise_for_status()

    products = response.json()["products"]

    for product in products:
        print(product["id"])

    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

Output:

158285472
107684121
88842655
88899155
84894032
94728488
107684117
84894015
80349294
84894042
>>> 

Combining this with another API, you can get specific information for each product, given the product id:

def get_product_info(product_id):

    import requests

    url = "https://restql-server-api-v2-americanas.b2w.io/run-query/catalogo/product-buybox/5"

    params = {
        "c_opn": "",
        "id": product_id,
        "offerLimit": "1",
        "opn": "",
        "tags": "prebf*|SUL_SUDESTE_CENTRO|livros_prevenda"
    }

    response = requests.get(url, params=params)
    response.raise_for_status()

    info = response.json()

    return info["product"]["result"]["name"], info["installment"]["result"][0][0]["total"]

def main():

    import requests

    url = "https://mystique-v2-americanas.juno.b2w.io/search"

    params = {
        "offset": "0",
        "sortBy": "topSelling",
        "source": "omega",
        "filter": [
            '{"id":"referer","value":"/hotsite/todas-ofertas-mundo","fixed":true,"hidden":true}',
            '{"id":"currency","value":"USD","fixed":true,"name":"moeda","hidden":true}'
        ],
        "limit": "10",
        "suggestion": "true"
    }

    response = requests.get(url, params=params)
    response.raise_for_status()

    products = response.json()["products"]

    for product in products:
        name, price = get_product_info(product["id"])
        print(f"The name is \"{name}\" and the price is {price}.")

    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

Output:

The name is "Smartwatch Esportivo Blitzwolf ® BW-HL1 ip68 e Multi Idiomas" and the price is 197.17.
The name is "Bebe reborn girafinha" and the price is 466.87.
The name is "Boneca Bebe Reborn 45 Cm corpo todo de Silicone Boneca Menina Reborn Realista bebes cabelo e olhos castanhos NPKDOLL" and the price is 400.28.
The name is "Boneca Bebê Reborn 43cm Corpo Todo Silicone - Menina com Cabelo Cacheado e Ursinho de pelúcia KAYDORA" and the price is 397.48.
The name is "Boneca Bebe Reborn Menina com roupa de Pandinha 47 cm NPKDOLL" and the price is 329.28.
The name is "Fones De Ouvido Sem Fio Bluetooth Xiaomi Redmi Airdots" and the price is 256.2.
The name is "Boneca Bebe Reborn Menino Girafinha 48 Cm Menino com Pelucia Girafa Azul NPKDOLL" and the price is 464.63.
The name is "Boneca Bebê Reborn Menina Realista de Silicone e Algodão 48cm e Girafinha NPKDOLL" and the price is 289.96.
The name is "Mini Caixa de Som Portátil Speaker  a Prova D’Água - Xiaomi" and the price is 165.2.
The name is "Boneca Bebe Reborn Menina princesa com casaco de inverno de coelhinho 45 cm NPKDOLL" and the price is 372.4.
>>> 

You get the idea. I haven't actually tried to set the limit query string parameter to anything other than ten, so you may want to play around with that.

Upvotes: 2

Related Questions