Reputation: 371
Thanks to SO and @QHarr, the following code works fine with URLs such as
https://www.amazon.com/dp/B00FSCBQV2
But it doesn't work with a URL such as this -
https://www.amazon.com/dp/B01N1ZD912/
My result is -
'R1_NO' :'.zg_hrsr { margin: 0; padding: 0; list-style-type:
none;}\n.zg_hrsr_item { margin: 0 0 0 10px; }\n.zg_hrsr_rank {
display:inline-block; width: 80px; text-align: right; }'}'
it should be returning
R1_NO = 42553
R1_CAT = Baby Care Products
R2_NO = 6452
R2_CAT = Baby Bathing Products (Health & Household)
This is due to the ranking data not on the first line. What needs to be done to get the desired results? Also can this script be condensed/more efficient?
I've tried handling it with bs4 select.one, getting text strip, nothing I do works. Please help me!
fields = ['Amazon Best Sellers Rank']
temp_dict = {}
for field in fields:
element = soup.select_one('li:contains("' + field + '")')
if element is None:
temp_dict[field] = 'NA'
else:
if field == 'Amazon Best Sellers Rank':
item='NA'
item = [re.sub('#|\(','', string).strip() for string in soup.select_one('li:contains("' + field + '")').stripped_strings][1].split(' in ')
temp_dict[field] = item
else:
item = [string for string in element.stripped_strings][1]
temp_dict[field] = item.replace('(', '').strip()
ranks = soup.select('.zg_hrsr_rank')
ladders = soup.select('.zg_hrsr_ladder')
if ranks:
cat_nos = [item.text.split('#')[1] for item in ranks]
else:
cat_nos = ['NA']
if ladders:
cats = [item.text.split('\xa0')[1] for item in soup.select('.zg_hrsr_ladder')]
else:
cats = ['NA']
rankings = dict(zip(cat_nos, cats))
map_dict = {'Amazon Best Sellers Rank': ['R1_NO','R1_CAT']}
final_dict = {}
final_dict['R2_NO'] = 'NA'
final_dict['R2_CAT'] = 'NA'
final_dict['R3_NO'] = 'NA'
final_dict['R3_CAT'] = 'NA'
final_dict['R4_NO'] = 'NA'
final_dict['R4_CAT'] = 'NA'
for k,v in temp_dict.items():
if k == 'Amazon Best Sellers Rank' and v!= 'NA':
item = dict(zip(map_dict[k],v))
final_dict = {**final_dict, **item}
elif k == 'Amazon Best Sellers Rank' and v == 'NA':
item = dict(zip(map_dict[k], [v, v]))
final_dict = {**final_dict, **item}
else:
final_dict[map_dict[k]] = v
for k,v in enumerate(rankings):
#print(k + 1, v, rankings[v])
prefix = 'R' + str(k + 2) + '_'
final_dict[prefix + 'NO'] = v
final_dict[prefix + 'CAT'] = rankings[v]
I expect it to handle and return values for both URLs posted in the question
Upvotes: 1
Views: 174
Reputation: 84465
So due to difference in html layout the stripped strings leads to the inline css being returned. You could try shortening and using regex. One could tighten up the regex but I will wait and see if you find fail cases first.
import requests
from bs4 import BeautifulSoup as bs
import re
links = ['https://www.amazon.com/dp/B00FSCBQV2?th=1','https://www.amazon.com/dp/B01N1ZD912/','https://www.amazon.com/Professional-Dental-Guard-Remoldable-Customizable/dp/B07L4YHBQ4', 'https://www.amazon.com/dp/B0040ODFK4/?tag=stackoverfl08-20']
map_dict = {'Product Dimensions': 'dimensions', 'Shipping Weight': 'weight', 'Item model number': 'Item_No', 'Amazon Best Sellers Rank': ['R1_NO','R1_CAT']}
# This handles when a ranking is from 1 to x,xxx,xxx
p = re.compile(r'#([0-9][0-9,]*)+[\n\s]+in[\n\s]+([A-Za-z&\s]+)')
with requests.Session() as s:
for link in links:
r = s.get(link, headers = {'User-Agent': 'Mozilla\5.0'})
soup = bs(r.content, 'lxml')
fields = ['Product Dimensions', 'Shipping Weight', 'Item model number', 'Amazon Best Sellers Rank']
final_dict = {}
for field in fields:
element = soup.select_one('li:contains("' + field + '")')
if element is None:
if field == 'Amazon Best Sellers Rank':
item = dict(zip(map_dict[field], ['N/A','N/A']))
final_dict = {**final_dict, **item}
else:
final_dict[map_dict[field]] = 'N/A'
else:
if field == 'Amazon Best Sellers Rank':
text = element.text
i = 1
for x,y in p.findall(text):
prefix = 'R' + str(i) + '_'
final_dict[prefix + 'NO'] = x
final_dict[prefix + 'CAT'] = y.strip()
i+=1
else:
item = [string for string in element.stripped_strings][1]
final_dict[map_dict[field]] = item.replace('(', '').strip()
print(final_dict)
Upvotes: 1