workin 4weekend
workin 4weekend

Reputation: 371

wrong data returned when scraping specific text from specific table elements

Thanks to SO and @QHarr, the following code works fine with URLs such as

https://www.amazon.com/dp/B00FSCBQV2

But it doesn't work with a URL such as this -

https://www.amazon.com/dp/B01N1ZD912/

My result is -

'R1_NO' :'.zg_hrsr { margin: 0; padding: 0; list-style-type: 
none;}\n.zg_hrsr_item { margin: 0 0 0 10px; }\n.zg_hrsr_rank { 
display:inline-block; width: 80px; text-align: right; }'}'

it should be returning

R1_NO = 42553 
R1_CAT = Baby Care Products
R2_NO = 6452
R2_CAT = Baby Bathing Products (Health & Household)

This is due to the ranking data not on the first line. What needs to be done to get the desired results? Also can this script be condensed/more efficient?

I've tried handling it with bs4 select.one, getting text strip, nothing I do works. Please help me!

fields = ['Amazon Best Sellers Rank']

            temp_dict = {}

            for field in fields:
                element = soup.select_one('li:contains("' + field + '")')
                if element is None:
                    temp_dict[field] = 'NA'
                else:
                    if field == 'Amazon Best Sellers Rank':
                        item='NA'
                        item = [re.sub('#|\(','', string).strip() for string in soup.select_one('li:contains("' + field + '")').stripped_strings][1].split(' in ')
                        temp_dict[field] = item
                    else:
                        item = [string for string in element.stripped_strings][1]
                        temp_dict[field] = item.replace('(', '').strip()

            ranks = soup.select('.zg_hrsr_rank')
            ladders = soup.select('.zg_hrsr_ladder')

            if ranks:
                cat_nos = [item.text.split('#')[1] for item in ranks]
            else:
                 cat_nos = ['NA']

            if ladders:
                cats = [item.text.split('\xa0')[1] for item in soup.select('.zg_hrsr_ladder')]
            else:
                cats = ['NA']

            rankings = dict(zip(cat_nos, cats))

            map_dict = {'Amazon Best Sellers Rank': ['R1_NO','R1_CAT']}

            final_dict = {}

            final_dict['R2_NO'] = 'NA'
            final_dict['R2_CAT'] = 'NA'
            final_dict['R3_NO'] = 'NA'
            final_dict['R3_CAT'] = 'NA'
            final_dict['R4_NO'] = 'NA'
            final_dict['R4_CAT'] = 'NA'

            for k,v in temp_dict.items():
                if k == 'Amazon Best Sellers Rank' and v!= 'NA':
                    item = dict(zip(map_dict[k],v))
                    final_dict = {**final_dict, **item}
                elif k == 'Amazon Best Sellers Rank' and v == 'NA':
                    item = dict(zip(map_dict[k], [v, v]))
                    final_dict = {**final_dict, **item}
                else:
                    final_dict[map_dict[k]] = v

            for k,v in enumerate(rankings):
                #print(k + 1, v, rankings[v])
                prefix = 'R' + str(k + 2) + '_'
                final_dict[prefix + 'NO'] = v
                final_dict[prefix + 'CAT'] = rankings[v]

I expect it to handle and return values for both URLs posted in the question

Upvotes: 1

Views: 174

Answers (1)

QHarr
QHarr

Reputation: 84465

So due to difference in html layout the stripped strings leads to the inline css being returned. You could try shortening and using regex. One could tighten up the regex but I will wait and see if you find fail cases first.

import requests
from bs4 import BeautifulSoup as bs
import re

links = ['https://www.amazon.com/dp/B00FSCBQV2?th=1','https://www.amazon.com/dp/B01N1ZD912/','https://www.amazon.com/Professional-Dental-Guard-Remoldable-Customizable/dp/B07L4YHBQ4', 'https://www.amazon.com/dp/B0040ODFK4/?tag=stackoverfl08-20']
map_dict = {'Product Dimensions': 'dimensions', 'Shipping Weight': 'weight', 'Item model number': 'Item_No', 'Amazon Best Sellers Rank': ['R1_NO','R1_CAT']}

# This handles when a ranking is from 1 to x,xxx,xxx
p = re.compile(r'#([0-9][0-9,]*)+[\n\s]+in[\n\s]+([A-Za-z&\s]+)')

with requests.Session() as s:
    for link in links:
        r = s.get(link, headers = {'User-Agent': 'Mozilla\5.0'})
        soup = bs(r.content, 'lxml')
        fields = ['Product Dimensions', 'Shipping Weight', 'Item model number', 'Amazon Best Sellers Rank']
        final_dict = {}

        for field in fields:
            element = soup.select_one('li:contains("' + field + '")')
            if element is None:
                if field == 'Amazon Best Sellers Rank':
                    item = dict(zip(map_dict[field], ['N/A','N/A']))
                    final_dict = {**final_dict, **item}
                else:
                    final_dict[map_dict[field]] = 'N/A'
            else:
                if field == 'Amazon Best Sellers Rank':      
                    text = element.text
                    i = 1
                    for x,y in p.findall(text):
                        prefix = 'R' + str(i) + '_'
                        final_dict[prefix + 'NO'] = x  
                        final_dict[prefix + 'CAT'] = y.strip()
                        i+=1
                else:
                    item = [string for string in element.stripped_strings][1]
                    final_dict[map_dict[field]] = item.replace('(', '').strip()
        print(final_dict)

Upvotes: 1

Related Questions