Reputation: 78
I'm pretty new using BeautifulSoup because I want to buy 2 new ram memories so I go to the web that I use to watch the prices of the components in a lot of stores, but I want to extract all the information about the prices of all the products, and if 2 or more of them are in the same store, save it in a txt file.
This is the page that i use https://www.solotodo.cl ,so when I add the product staff to the url
and get it into the code the table tag did't appear when i use the find_all() function
for i in hrefs:
respond=requests.get(url_base+i)
sopa=BeautifulSoup(respond.text,'html.parser')
tabla=soup.find('table',{"class":"table table-sm mb-0"})
print(tabla)
Then when it is printed in the console only these parentheses are shown []
So I wanted to check if everything was in order, the only thing I did was go to the parent div to see if I could do something from there, but when it was time to print it by console, the whole part of the table did not appear
for i in hrefs:
respond=requests.get(link_base+i)
sopa=BeautifulSoup(respond.text,'html.parser')
tabla=sopa.find("div",{"class":"content-card","id":"product-prices-table"})
print(tabla)
this is what i see
<div class="content-card" id="product-prices-table"><div class="d-flex justify-content-end flex-wrap"><div class="mt-2"><butt...
the problem is that between the first two tags should be the table tag, something like this
<div class = "content-card" ...> <table class = "table table-sm mb-0"> < div ...>
So my question:
Upvotes: 1
Views: 332
Reputation: 84465
Explanation:
The page makes a number of requests dynamically to get that content, as per table on page. You can recreate this by working out what the browser is doing.
You can extract the preferred stores (ids and names) from a script
tag at the initial endpoint
r = s.get('https://www.solotodo.cl/products/66417-kingston-hyperx-fury-black-hx426c16fb38-1-x-8gb-dimm-ddr4-2666')
data = json.loads(re.search(r'(\{"props.*?)<', r.text).group(1))
store_dict = {i['id']:i['name'] for i in data['props']['pageProps']['preferredCountryStores']}
Also extract the product id for later use:
product_id = data['props']['pageProps']['product']['id']
Using those store ids you can then request from an API the ratings and store them as a dict of store id:rating pairs
ratings_url = f"https://publicapi.solotodo.com/stores/average_ratings/?{''.join([f'&ids={k}' for k in store_dict.keys()])}".replace('?&ids','?ids')
ratings_dict = {i['store'].split('/')[-2]:i['rating'] for i in requests.get(ratings_url).json()}
You can then use the store ids and product id from earlier to retrieve the prices from API:
prices_url = f'https://publicapi.solotodo.com/products/available_entities/?ids={product_id}' + \
''.join([f'&stores={k}' for k in store_dict.keys()])
r = s.get(prices_url).json()
From the response json you can extract the required price info.
You can then recreate the table, with some formatting and handling of cases where no rating:
import requests, re, json
import pandas as pd
with requests.Session() as s:
r = s.get('https://www.solotodo.cl/products/66417-kingston-hyperx-fury-black-hx426c16fb38-1-x-8gb-dimm-ddr4-2666')
data = json.loads(re.search(r'(\{"props.*?)<', r.text).group(1))
store_dict = {i['id']:i['name'] for i in data['props']['pageProps']['preferredCountryStores']}
product_id = data['props']['pageProps']['product']['id']
ratings_url = f"https://publicapi.solotodo.com/stores/average_ratings/?{''.join([f'&ids={k}' for k in store_dict.keys()])}".replace('?&ids','?ids')
ratings_dict = {i['store'].split('/')[-2]:i['rating'] for i in requests.get(ratings_url).json()}
prices_url = f'https://publicapi.solotodo.com/products/available_entities/?ids={product_id}' + \
''.join([f'&stores={k}' for k in store_dict.keys()])
r = s.get(prices_url).json()
tiendas = []
ratings = []
ofertas = []
normales = []
for i in r['results'][0]['entities']:
tiendas.append(store_dict[int(i['store'].split('/')[-2])])
try:
ratings.append(ratings_dict[i['store'].split('/')[-2]])
except:
ratings.append('No rating')
normales.append('{:.3f}'.format(float(i['active_registry']['normal_price'])/1000))
ofertas.append('{:.3f}'.format(float(i['active_registry']['offer_price'])/1000))
df = pd.DataFrame([tiendas, ratings, ofertas, normales]).T
df.columns = ['Tienda', 'Rating', 'P.Oferta', 'P.Normal']
df.head(5)
Sample output:
Using Chrome dev tools:
You can press F12 to open dev tools in Chrome, press F5 to refresh you chosen webpage, then inspect the recorded network activity in the network tab.
With this recorded activity you can hunt for target values and determine how the browser, in this particular case, is obtaining content.
Example:
For more info on using dev tools in this way see 1 and 2
Regex:
Upvotes: 1