Reputation: 137
I am trying to scrape this url: https://www.amazon.in/Sparx-SM-687-Forest-Golden-SX0687GGFGO_0009/dp/B098BC48PZ/ref=sr_1_73?keywords=mens%2Bshoes&qid=1674804026&sr=8-73&th=1&psc=1
I need to get a few details about the products. The below screenshot shows the target I need to scrape. I want the Manufacturer and Item Model number from the below Image.
I tried several options given on:-
Extracting a specific list item using Beautiful Soup
BeautifulSoup - search by text inside a tag
I wrote the code but it's giving me all the values present under the ul>li tag.
import requests
from bs4 import BeautifulSoup
import time
import pandas as pd
from tqdm import tqdm
r3 = requests.get('https://www.amazon.in/Sparx-SM-687-Forest-Golden-SX0687GGFGO_0009/dp/B098B9JVBK/ref=sr_1_73?keywords=mens+shoes&qid=1674804026&sr=8-73',headers=header)
soup = BeautifulSoup(r3.text,'lxml')
Code for getting all the URLs of the products.
prod_urls = []
def get_prod_url(cat):
for page in range(1,8):
url = f'https://www.amazon.in/s?k={cat}'
r = requests.get(url+f'&page={page}', headers=header)
soup3 = BeautifulSoup(r.text, 'lxml')
if 'kitchen' in url:
for i in soup3.find_all('h2',{'class':'a-size-mini a-spacing-none a-color-base s-line-clamp-4'}):
for j in i.find_all('a'):
prod_urls_kitchen.append(j['href'])
else:
for i in soup3.find_all('h2',{'class':'a-size-mini a-spacing-none a-color-base s-line-clamp-2'}):
for j in i.find_all('a'):
prod_urls_kitchen.append(j['href'])
return get_prod_url
get_prod_url("men's+shoes")
This gives me the URLs of all the pages of the mentioned category. Code for getting the required information from the page:
data = []
for i in tqdm(prod_urls_kitchen[0:50]):
header = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'}
url = 'https://www.amazon.in'+i
r = requests.get(url,headers=header)
soup = BeautifulSoup(r.text,'lxml')
try:
prod_name = soup.find('span',{'id':'productTitle'}).text.strip()
except:
prod_name = None
try:
rating = soup.find('span',{'id':'acrPopover'}).text.strip().replace(' out of 5 stars','')
except:
rating = None
try:
color = soup.find(lambda tag: len(tag.find_all()) == 0 and " Colour: " in tag.text).find_next('span').text.strip()
except:
color = None
try:
table = soup.find('table', {'class':'a-normal a-spacing-micro'})
except:
table = None
if table != None:
try:
model = table.find('tr',{'class':'a-spacing-small po-model_name'}).find('td',{'class':'a-span9'}).text.strip()
except:
model = None
try:
brand = table.find('tr',{'class':'a-spacing-small po-brand'}).find('td',{'class':'a-span9'}).text.strip()
except:
brand = None
else:
model = None
brand = None
try:
for i in soup.find('ul',{'class':'a-unordered-list a-horizontal a-size-small'}).find_all('li'):
category = i.find('span',{'class':'a-list-item'}).text.strip()
except:
category = None
try:
asin = soup.find('div',{'id':'title_feature_div'})['data-csa-c-asin']
except:
asin = None
try:
img_url = soup.find('div',{'id':'imgTagWrapperId'}).find('img')['src']
except:
img_url = None
try:
price = soup.find('span',{'class':'a-price-whole'}).text.replace('.','')
except:
price = None
try:
total_rating = soup.find('span',{'id':'acrCustomerReviewText'}).text.strip().replace(' ratings','')
except:
total_rating = None
try:
for i in soup.find('ul',{'class':'a-unordered-list a-vertical a-spacing-mini'}).find_all('li'):
desc = i.find('span',{'class':'a-list-item'}).text
except:
desc = None
data.append({'Product_Name':prod_name,
'Product_Review':rating,
'Product_Color':color,
'Product_model':model,
'Product_brand':brand,
'Product_Category':category,
'Product_Description':desc,
'Asin_ID':asin,
'Product_img_URL':img_url,
'Product_Total_Rating':total_rating,
'Product_Price':price})
but the above code not giving values as these tags are not in the page. So, I tried to rewrite the code using the below loop.
for i in soup.find('div',{'id':'detailBullets_feature_div'}).find_all('li'):
print(i.select_one('span', string='''Manufacturer
:
'''))
Can any one suggest how to scrape this data?
Upvotes: 0
Views: 172
Reputation: 25196
Based on that part of question:
So, I tried to rewrite the code using the below loop.
To get a single value of the bullet select it with css selector
and pseudo class :-soup-contains()
with next sibling operator:
soup.select_one('#detailBulletsWrapper_feature_div span:-soup-contains("Manufacturer") + span').text
To get a dict
of the bullets and its values use a dict comprehension
what enables you to pick or filter based on available keys:
{e.text.split('\n')[0]:e.find_next_sibling('span').text for e in soup.select('#detailBulletsWrapper_feature_div li .a-text-bold')}
Be aware, if there are duplictaed bullets, this has to be adjust in a way it fits best to your needs, because there have to be unique keys in a dict
Possible solution for first value wins:
details = {}
for e in soup.select('#detailBulletsWrapper_feature_div li .a-text-bold'):
if not details.get(e.text.split('\n')[0]):
details.update({e.text.split('\n')[0]:e.find_next_sibling('span').text} )
from bs4 import BeautifulSoup
html = '''
<div id="detailBulletsWrapper_feature_div" data-feature-name="detailBullets" data-template-name="detailBullets" class="a-section feature detail-bullets-wrapper bucket" data-cel-widget="detailBulletsWrapper_feature_div"> <hr aria-hidden="true" class="a-divider-normal bucketDivider"> <h2>Product details</h2>
<div id="detailBullets_feature_div">
<ul class="a-unordered-list a-nostyle a-vertical a-spacing-none detail-bullet-list"> <li><span class="a-list-item"> <span class="a-text-bold">Product Dimensions
:
</span> <span>33 x 23 x 12 cm; 600 Grams</span> </span></li> <li><span class="a-list-item"> <span class="a-text-bold">Date First Available
:
</span> <span>30 June 2021</span> </span></li> <li><span class="a-list-item"> <span class="a-text-bold">Manufacturer
:
</span> <span>RELAXO FOOTWEARS LIMITED</span> </span></li> <li><span class="a-list-item"> <span class="a-text-bold">ASIN
:
</span> <span>B098BC48PZ</span> </span></li> <li><span class="a-list-item"> <span class="a-text-bold">Item model number
:
</span> <span>SX0687G</span> </span></li> <li><span class="a-list-item"> <span class="a-text-bold">Country of Origin
:
</span> <span>India</span> </span></li> <li><span class="a-list-item"> <span class="a-text-bold">Department
:
</span> <span>Mens</span> </span></li> <li><span class="a-list-item"> <span class="a-text-bold">Manufacturer
:
</span> <span>RELAXO FOOTWEARS LIMITED, RELAXO FOOTWEARS LIMITED, Aggarwal City Square, Plot No 10, Mangalam Palace. District Center, Rohini Sector-3, Delhi - 110085</span> </span></li> <li><span class="a-list-item"> <span class="a-text-bold">Packer
:
</span> <span>VIRAJ ENTERPRISES, Killa No. 31/18/1/2(2-4), Surya Nagar, Gali No. 1, Near Parle Factory, Jhajjar, Bahadurgarh, 124507</span> </span></li> </div>
</div>
'''
soup = BeautifulSoup(html)
details = {}
for e in soup.select('#detailBulletsWrapper_feature_div li .a-text-bold'):
if not details.get(e.text.split('\n')[0]):
details.update({e.text.split('\n')[0]:e.find_next_sibling('span').text} )
print(soup.select_one('#detailBulletsWrapper_feature_div span:-soup-contains("Manufacturer") + span').text)
print(details)
Under Armour
and
{'Product Dimensions': '33 x 23 x 12 cm; 600 Grams',
'Date First Available': '30 June 2021',
'Manufacturer': 'RELAXO FOOTWEARS LIMITED',
'ASIN': 'B098BC48PZ',
'Item model number': 'SX0687G',
'Country of Origin': 'India',
'Department': 'Mens',
'Packer': 'VIRAJ ENTERPRISES, Killa No. 31/18/1/2(2-4), Surya Nagar, Gali No. 1, Near Parle Factory, Jhajjar, Bahadurgarh, 124507'}
Upvotes: 2