renan
renan

Reputation: 130

Python unable to get API with requests: Web scraping, Requests, API

I'm trying to web scrape a web site through python, however I'm not able to retrieve the correct API with requests, because I can't get the product information:

This is the website, someone is able to get the API answer with products information, like name and price? Obs: It's important to notice that the web site product's loads as you scroll down.

https://www.atacadao.com.br/bebidas/

If i'm not able to do it through requests, I'll probably go for selenium, which I really wanted to avoid, because of its poor efficiency for scraping.

Thanks in advance :)

Upvotes: 0

Views: 192

Answers (1)

furas
furas

Reputation: 143032

Using DevTools in Firefox/Chrome (tab: Network, filter: xhr) I found that JavaScript read data as JSON from URL

https://www.atacadao.com.br/catalogo/search/?q=&category_id=null&category[]=bebidas&page=1&order_by=-relevance

So using requests I can run

import requests

url = 'https://www.atacadao.com.br/catalogo/search/?q=&category_id=null&category[]=bebidas&page=1&order_by=-relevance'

r = requests.get(url)

print(r.text[:1000])   # show only beginning of data
print('------------')

data = r.json()

for item in data['results'][:3]:  # I use `[:3]` to show only first three results
    #print(item.keys())

    #for key, val in item.items():
    #    print(f'{key}: {val}')

    print('name:', item['name'])
    print('price:', item['price'])
    print('url:', item['url'])

    print('---')

to get

{"paginator": {"page_range": [1, 2, 3, 4, 5], "page_number": 1, "last": 157, "first": "", "previous": "", "next": 2}, "results": [{"pk": 4854, "full_display": "Refrigerante lata 350ml - Coca Cola", "name": "Refrigerante", "brand": "Coca Cola", "type": "", "category": "Refrigerantes", "unit": "UN", "cart": {"cart": false, "multiplier": "", "count": "", "distributor_id": null, "distributor_name": null}, "photo_url": ["https://media.cotabest.com.br/media/sku/refrigerante-coca-cola-lata-350ml-coca-cola-un.png"], "price": {"price": "2,05", "multiplier": 6.0, "distributor_name": "ATACAD\u00c3O CD BEL\u00c9M", "distributor_id": 84022367}, "highlight": true, "price_statistics": {"quantity_prices": 20, "discount": 31, "cheaper": {"price": "2,05", "multiplier": 6.0, "distributor_name": "ATACAD\u00c3O CD BEL\u00c9M", "distributor_id": 84022367}, "expensive": "2.99"}, "multipliers": [{"unit_price": "2.05", "multiplier": "6.00", "distributor_id": 84022367}, {"unit_price": "2.05", "multiplier": "6.0
------------
name: Refrigerante
price: {'price': '2,05', 'multiplier': 6.0, 'distributor_name': 'ATACADÃO CD BELÉM', 'distributor_id': 84022367}
url: /refrigerante-coca-cola-lata-350ml
---
name: Refrigerante
price: {'price': '7,12', 'multiplier': 6.0, 'distributor_name': 'PMG', 'distributor_id': 4921}
url: /refrigerante-coca-cola-pet-2litros
---
name: Whisky
price: {'price': '89,00', 'multiplier': 12.0, 'distributor_name': 'ATACADÃO CD BETIM', 'distributor_id': 74133922}
url: /whisky-red-label-johnnie-walker-garrafa-1litro
---

Url has page=1 so I can use it with different values to load other pages.

But I will use dictionary with params to make it simpler

url = 'https://www.atacadao.com.br/catalogo/search/'

payload = {
    'q': '',
    'category_id': 'null',
    'category[]': 'bebidas',
    'page': 1,
    'order_by': '-relevance'
}

payload['page'] = 1  # 2, 3, etc.

r = requests.get(url, params=payload)

Full code

import requests

url = 'https://www.atacadao.com.br/catalogo/search/'

payload = {
    'q': '',
    'category_id': 'null',
    'category[]': 'bebidas',
    'page': 1,
    'order_by': '-relevance'
}

for number in range(1, 6):
    print('\n=== page:', number, '===\n')
    
    payload['page'] = number
    
    r = requests.get(url, params=payload)
    #print(r.text[:1000])

    data = r.json()

    for item in data['results']: #[:3]:  # I use `[:3]` to show only first three results
        #print(item.keys())
        print('name:', item['name'])
        print('price:', item['price'])
        print('url:', item['url'])
        print('---')

Result:

=== page: 1 ===

name: Refrigerante
price: {'price': '2,05', 'multiplier': 6.0, 'distributor_name': 'ATACADÃO CD BELÉM', 'distributor_id': 84022367}
url: /refrigerante-coca-cola-lata-350ml
---
name: Refrigerante
price: {'price': '7,12', 'multiplier': 6.0, 'distributor_name': 'PMG', 'distributor_id': 4921}
url: /refrigerante-coca-cola-pet-2litros
---
name: Whisky
price: {'price': '89,00', 'multiplier': 12.0, 'distributor_name': 'ATACADÃO CD BETIM', 'distributor_id': 74133922}
url: /whisky-red-label-johnnie-walker-garrafa-1litro
---

=== page: 2 ===

name: Whisky
price: {'price': '39,83', 'multiplier': 1.0, 'distributor_name': 'ATACADÃO CD IGARASSU', 'distributor_id': 95849062}
url: /whisky-escoces-passport-garrafa-1litro
---
name: Refrigerante
price: {'price': '1,95', 'multiplier': 6.0, 'distributor_name': 'ATACADÃO CD MANAUS', 'distributor_id': 84019700}
url: /refrigerante-laranja-fanta-lata-350ml
---
name: Suco Integral
price: {'price': '10,97', 'multiplier': 6.0, 'distributor_name': 'ATACADÃO CD VILA VELHA', 'distributor_id': 96142380}
url: /suco-integral-sabor-uva-aurora-vidro-15litros
---

BTW:

In JSON you can see

"paginator": {"page_range": [1, 2, 3, 4, 5], "page_number": 1, "last": 157, "first": "", "previous": "", "next": 2}

and you could use while True loop with number = data["paginator"]["next"] to load all pages.

I checked that last page has empty string in next.

number = "1"

while True:

    print('\n=== page:', number, '===\n')
    
    payload['page'] = number
    
    r = requests.get(url, params=payload)
    #print(r.text[:1000])   # show only beginning of data

    data = r.json()

    for item in data['results'][:3]:   # show only first three results
        #print(item.keys())
        print('name:', item['name'])
        print('price:', item['price'])
        print('url:', item['url'])
        print('---')
        
    number = data['pagination']['next']
    
    if not number:
        break

I put code from my answer on GitHub python-examples in folder __scraping__

Upvotes: 2

Related Questions