Reputation: 130
I'm trying to web scrape a web site through python, however I'm not able to retrieve the correct API with requests, because I can't get the product information:
This is the website, someone is able to get the API answer with products information, like name and price? Obs: It's important to notice that the web site product's loads as you scroll down.
https://www.atacadao.com.br/bebidas/
If i'm not able to do it through requests, I'll probably go for selenium, which I really wanted to avoid, because of its poor efficiency for scraping.
Thanks in advance :)
Upvotes: 0
Views: 192
Reputation: 143032
Using DevTools
in Firefox
/Chrome
(tab: Network
, filter: xhr
) I found that JavaScript
read data as JSON from URL
So using requests
I can run
import requests
url = 'https://www.atacadao.com.br/catalogo/search/?q=&category_id=null&category[]=bebidas&page=1&order_by=-relevance'
r = requests.get(url)
print(r.text[:1000]) # show only beginning of data
print('------------')
data = r.json()
for item in data['results'][:3]: # I use `[:3]` to show only first three results
#print(item.keys())
#for key, val in item.items():
# print(f'{key}: {val}')
print('name:', item['name'])
print('price:', item['price'])
print('url:', item['url'])
print('---')
to get
{"paginator": {"page_range": [1, 2, 3, 4, 5], "page_number": 1, "last": 157, "first": "", "previous": "", "next": 2}, "results": [{"pk": 4854, "full_display": "Refrigerante lata 350ml - Coca Cola", "name": "Refrigerante", "brand": "Coca Cola", "type": "", "category": "Refrigerantes", "unit": "UN", "cart": {"cart": false, "multiplier": "", "count": "", "distributor_id": null, "distributor_name": null}, "photo_url": ["https://media.cotabest.com.br/media/sku/refrigerante-coca-cola-lata-350ml-coca-cola-un.png"], "price": {"price": "2,05", "multiplier": 6.0, "distributor_name": "ATACAD\u00c3O CD BEL\u00c9M", "distributor_id": 84022367}, "highlight": true, "price_statistics": {"quantity_prices": 20, "discount": 31, "cheaper": {"price": "2,05", "multiplier": 6.0, "distributor_name": "ATACAD\u00c3O CD BEL\u00c9M", "distributor_id": 84022367}, "expensive": "2.99"}, "multipliers": [{"unit_price": "2.05", "multiplier": "6.00", "distributor_id": 84022367}, {"unit_price": "2.05", "multiplier": "6.0
------------
name: Refrigerante
price: {'price': '2,05', 'multiplier': 6.0, 'distributor_name': 'ATACADÃO CD BELÉM', 'distributor_id': 84022367}
url: /refrigerante-coca-cola-lata-350ml
---
name: Refrigerante
price: {'price': '7,12', 'multiplier': 6.0, 'distributor_name': 'PMG', 'distributor_id': 4921}
url: /refrigerante-coca-cola-pet-2litros
---
name: Whisky
price: {'price': '89,00', 'multiplier': 12.0, 'distributor_name': 'ATACADÃO CD BETIM', 'distributor_id': 74133922}
url: /whisky-red-label-johnnie-walker-garrafa-1litro
---
Url has page=1
so I can use it with different values to load other pages.
But I will use dictionary with params to make it simpler
url = 'https://www.atacadao.com.br/catalogo/search/'
payload = {
'q': '',
'category_id': 'null',
'category[]': 'bebidas',
'page': 1,
'order_by': '-relevance'
}
payload['page'] = 1 # 2, 3, etc.
r = requests.get(url, params=payload)
Full code
import requests
url = 'https://www.atacadao.com.br/catalogo/search/'
payload = {
'q': '',
'category_id': 'null',
'category[]': 'bebidas',
'page': 1,
'order_by': '-relevance'
}
for number in range(1, 6):
print('\n=== page:', number, '===\n')
payload['page'] = number
r = requests.get(url, params=payload)
#print(r.text[:1000])
data = r.json()
for item in data['results']: #[:3]: # I use `[:3]` to show only first three results
#print(item.keys())
print('name:', item['name'])
print('price:', item['price'])
print('url:', item['url'])
print('---')
Result:
=== page: 1 ===
name: Refrigerante
price: {'price': '2,05', 'multiplier': 6.0, 'distributor_name': 'ATACADÃO CD BELÉM', 'distributor_id': 84022367}
url: /refrigerante-coca-cola-lata-350ml
---
name: Refrigerante
price: {'price': '7,12', 'multiplier': 6.0, 'distributor_name': 'PMG', 'distributor_id': 4921}
url: /refrigerante-coca-cola-pet-2litros
---
name: Whisky
price: {'price': '89,00', 'multiplier': 12.0, 'distributor_name': 'ATACADÃO CD BETIM', 'distributor_id': 74133922}
url: /whisky-red-label-johnnie-walker-garrafa-1litro
---
=== page: 2 ===
name: Whisky
price: {'price': '39,83', 'multiplier': 1.0, 'distributor_name': 'ATACADÃO CD IGARASSU', 'distributor_id': 95849062}
url: /whisky-escoces-passport-garrafa-1litro
---
name: Refrigerante
price: {'price': '1,95', 'multiplier': 6.0, 'distributor_name': 'ATACADÃO CD MANAUS', 'distributor_id': 84019700}
url: /refrigerante-laranja-fanta-lata-350ml
---
name: Suco Integral
price: {'price': '10,97', 'multiplier': 6.0, 'distributor_name': 'ATACADÃO CD VILA VELHA', 'distributor_id': 96142380}
url: /suco-integral-sabor-uva-aurora-vidro-15litros
---
BTW:
In JSON
you can see
"paginator": {"page_range": [1, 2, 3, 4, 5], "page_number": 1, "last": 157, "first": "", "previous": "", "next": 2}
and you could use while True
loop with number = data["paginator"]["next"]
to load all pages.
I checked that last page has empty string in next
.
number = "1"
while True:
print('\n=== page:', number, '===\n')
payload['page'] = number
r = requests.get(url, params=payload)
#print(r.text[:1000]) # show only beginning of data
data = r.json()
for item in data['results'][:3]: # show only first three results
#print(item.keys())
print('name:', item['name'])
print('price:', item['price'])
print('url:', item['url'])
print('---')
number = data['pagination']['next']
if not number:
break
I put code from my answer on GitHub python-examples in folder __scraping__
Upvotes: 2