Lalit Joshi
Lalit Joshi

Reputation: 137

BeautifulSoup: How to extract one or two values from li tags under span

I am trying to scrape this url: https://www.amazon.in/Sparx-SM-687-Forest-Golden-SX0687GGFGO_0009/dp/B098BC48PZ/ref=sr_1_73?keywords=mens%2Bshoes&qid=1674804026&sr=8-73&th=1&psc=1

I need to get a few details about the products. The below screenshot shows the target I need to scrape. I want the Manufacturer and Item Model number from the below Image.

I tried several options given on:-

Extracting a specific list item using Beautiful Soup

BeautifulSoup - search by text inside a tag

I wrote the code but it's giving me all the values present under the ul>li tag.

import requests
from bs4 import BeautifulSoup
import time
import pandas as pd
from tqdm import tqdm  

r3 = requests.get('https://www.amazon.in/Sparx-SM-687-Forest-Golden-SX0687GGFGO_0009/dp/B098B9JVBK/ref=sr_1_73?keywords=mens+shoes&qid=1674804026&sr=8-73',headers=header)
soup = BeautifulSoup(r3.text,'lxml')

Code for getting all the URLs of the products.

prod_urls = []
def get_prod_url(cat):
    for page in range(1,8):
        url = f'https://www.amazon.in/s?k={cat}'
        r = requests.get(url+f'&page={page}', headers=header)
        soup3 = BeautifulSoup(r.text, 'lxml')
        if 'kitchen' in url:
            for i in soup3.find_all('h2',{'class':'a-size-mini a-spacing-none a-color-base s-line-clamp-4'}):
                for j in i.find_all('a'):
                    prod_urls_kitchen.append(j['href'])
        else:
            for i in soup3.find_all('h2',{'class':'a-size-mini a-spacing-none a-color-base s-line-clamp-2'}):
                for j in i.find_all('a'):
                    prod_urls_kitchen.append(j['href'])
    return get_prod_url

get_prod_url("men's+shoes")

This gives me the URLs of all the pages of the mentioned category. Code for getting the required information from the page:

data = []
for i in tqdm(prod_urls_kitchen[0:50]):
    header = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'}
    url = 'https://www.amazon.in'+i
    r = requests.get(url,headers=header)
    soup = BeautifulSoup(r.text,'lxml')
    
    try:
        prod_name = soup.find('span',{'id':'productTitle'}).text.strip()
    except:
        prod_name = None
        
    try:
        rating = soup.find('span',{'id':'acrPopover'}).text.strip().replace(' out of 5 stars','')
    except:
        rating = None
        
    try:
        color = soup.find(lambda tag: len(tag.find_all()) == 0 and " Colour: " in tag.text).find_next('span').text.strip()
    except:
        color = None
        
    try:
        table = soup.find('table', {'class':'a-normal a-spacing-micro'})
    except:
        table = None
        
    if table != None:
        try:
            model = table.find('tr',{'class':'a-spacing-small po-model_name'}).find('td',{'class':'a-span9'}).text.strip()
        except:
            model = None
            
        try:
            brand = table.find('tr',{'class':'a-spacing-small po-brand'}).find('td',{'class':'a-span9'}).text.strip()
        except:
            brand = None
    else:
        model = None
        brand = None
    
    try:
        for i in soup.find('ul',{'class':'a-unordered-list a-horizontal a-size-small'}).find_all('li'):
            category = i.find('span',{'class':'a-list-item'}).text.strip()
    except:
        category = None
        
    try:
        asin = soup.find('div',{'id':'title_feature_div'})['data-csa-c-asin']
    except:
        asin = None
        
    try:
        img_url = soup.find('div',{'id':'imgTagWrapperId'}).find('img')['src']
    except:
        img_url = None
        
    try:
        price = soup.find('span',{'class':'a-price-whole'}).text.replace('.','')
    except:
        price = None
        
    try:
        total_rating = soup.find('span',{'id':'acrCustomerReviewText'}).text.strip().replace(' ratings','')
    except:
        total_rating = None
        
    try:
        for i in soup.find('ul',{'class':'a-unordered-list a-vertical a-spacing-mini'}).find_all('li'):
            desc = i.find('span',{'class':'a-list-item'}).text
    except:
        desc = None
        
        
    data.append({'Product_Name':prod_name,
                 'Product_Review':rating,
                 'Product_Color':color,
                 'Product_model':model,
                 'Product_brand':brand,
                 'Product_Category':category,
                 'Product_Description':desc,
                 'Asin_ID':asin,
                 'Product_img_URL':img_url,
                 'Product_Total_Rating':total_rating,
                 'Product_Price':price})

but the above code not giving values as these tags are not in the page. So, I tried to rewrite the code using the below loop.

for i in soup.find('div',{'id':'detailBullets_feature_div'}).find_all('li'):
            print(i.select_one('span', string='''Manufacturer
                                            ‏
                                                :
                                            ‎
                                        '''))

Target Data

Can any one suggest how to scrape this data?

Upvotes: 0

Views: 172

Answers (1)

HedgeHog
HedgeHog

Reputation: 25196

Based on that part of question:

So, I tried to rewrite the code using the below loop.

To get a single value of the bullet select it with css selector and pseudo class :-soup-contains() with next sibling operator:

soup.select_one('#detailBulletsWrapper_feature_div span:-soup-contains("Manufacturer") + span').text

To get a dict of the bullets and its values use a dict comprehension what enables you to pick or filter based on available keys:

{e.text.split('\n')[0]:e.find_next_sibling('span').text for e in soup.select('#detailBulletsWrapper_feature_div li .a-text-bold')}

Be aware, if there are duplictaed bullets, this has to be adjust in a way it fits best to your needs, because there have to be unique keys in a dict

Possible solution for first value wins:

details = {}
for e in soup.select('#detailBulletsWrapper_feature_div li .a-text-bold'):
    if not details.get(e.text.split('\n')[0]):
        details.update({e.text.split('\n')[0]:e.find_next_sibling('span').text} )

Example

from bs4 import BeautifulSoup

html = '''
<div id="detailBulletsWrapper_feature_div" data-feature-name="detailBullets" data-template-name="detailBullets" class="a-section feature detail-bullets-wrapper bucket" data-cel-widget="detailBulletsWrapper_feature_div"> <hr aria-hidden="true" class="a-divider-normal bucketDivider"> <h2>Product details</h2>
    <div id="detailBullets_feature_div">
             <ul class="a-unordered-list a-nostyle a-vertical a-spacing-none detail-bullet-list">        <li><span class="a-list-item"> <span class="a-text-bold">Product Dimensions
                                    ‏
                                        :
                                    ‎
                                </span> <span>33 x 23 x 12 cm; 600 Grams</span> </span></li>          <li><span class="a-list-item"> <span class="a-text-bold">Date First Available
                                    ‏
                                        :
                                    ‎
                                </span> <span>30 June 2021</span> </span></li>                                  <li><span class="a-list-item"> <span class="a-text-bold">Manufacturer
                                    ‏
                                        :
                                    ‎
                                </span> <span>RELAXO FOOTWEARS LIMITED</span> </span></li>          <li><span class="a-list-item"> <span class="a-text-bold">ASIN
                                    ‏
                                        :
                                    ‎
                                </span> <span>B098BC48PZ</span> </span></li>          <li><span class="a-list-item"> <span class="a-text-bold">Item model number
                                    ‏
                                        :
                                    ‎
                                </span> <span>SX0687G</span> </span></li>          <li><span class="a-list-item"> <span class="a-text-bold">Country of Origin
                                    ‏
                                        :
                                    ‎
                                </span> <span>India</span> </span></li>          <li><span class="a-list-item"> <span class="a-text-bold">Department
                                    ‏
                                        :
                                    ‎
                                </span> <span>Mens</span> </span></li>          <li><span class="a-list-item"> <span class="a-text-bold">Manufacturer
                                    ‏
                                        :
                                    ‎
                                </span> <span>RELAXO FOOTWEARS LIMITED, RELAXO FOOTWEARS LIMITED, Aggarwal City Square, Plot No 10, Mangalam Palace. District Center, Rohini Sector-3, Delhi - 110085</span> </span></li>          <li><span class="a-list-item"> <span class="a-text-bold">Packer
                                    ‏
                                        :
                                    ‎
                                </span> <span>VIRAJ ENTERPRISES, Killa No. 31/18/1/2(2-4), Surya Nagar, Gali No. 1, Near Parle Factory, Jhajjar, Bahadurgarh, 124507</span> </span></li>  </div>
</div>
'''
soup = BeautifulSoup(html)

details = {}
for e in soup.select('#detailBulletsWrapper_feature_div li .a-text-bold'):
    if not details.get(e.text.split('\n')[0]):
        details.update({e.text.split('\n')[0]:e.find_next_sibling('span').text} )

print(soup.select_one('#detailBulletsWrapper_feature_div span:-soup-contains("Manufacturer") + span').text)

print(details)

Outputs

Under Armour

and

{'Product Dimensions': '33 x 23 x 12 cm; 600 Grams',
 'Date First Available': '30 June 2021',
 'Manufacturer': 'RELAXO FOOTWEARS LIMITED',
 'ASIN': 'B098BC48PZ',
 'Item model number': 'SX0687G',
 'Country of Origin': 'India',
 'Department': 'Mens',
 'Packer': 'VIRAJ ENTERPRISES, Killa No. 31/18/1/2(2-4), Surya Nagar, Gali No. 1, Near Parle Factory, Jhajjar, Bahadurgarh, 124507'}

Upvotes: 2

Related Questions