Reputation: 1
I am having trouble with the following code. I want to extract of every product the title, the URL, the image URL and product number. And extract the data into an Excel spreadsheet.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://b2b.pmsinternational.com/search/?q=&submit=Search+Product+Name'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
items = soup.find_all('section', attrs={'class': 'products'})
rows = []
columns = ['product title', 'Item Number', 'Product URL','Image URL']
for item in items:
product_url = item.a['href']
product_image_url = item.a['src']
product_title = item.a['title']
product_number = item.a['ref']
row = [product_title, product_number, product_url, product_image_url]
rows.append(row)
df = pd.DataFrame(rows, columns=columns)
df.to_excel('PMS International Products.xlsx', index=False)
print('File Saved')
Error:
Traceback (most recent call last):
File "C:/Users/hansm/PycharmProjects/Scraping/main.py", line 17, in <module>
product_image_url = item.a['src']
File "C:\Users\hansm\PycharmProjects\Scraping\venv\lib\site-packages\bs4\element.py", line 1406, in __getitem__
return self.attrs[key]
KeyError: 'src'
KeyError
and sometimes the error can change to href
, title
, src
, ref
.Upvotes: 0
Views: 891
Reputation: 4779
You are getting a Key Error because you are looking for the keys - src
, title
and ref
that are not present in the<a>
tags.
<a>
tags href
- You are getting that right.img
tags src
attribute and not in a
tags src
(There is no src
attribute in a
tags).span
tag with class title
and not inside the a
tags title
(There is no title
attribute in a
tags).span
tag with class ref
and not inside the a
(There is no ref
attribute in a
tags).Also your code only gets the data of the first product of the web-page. I suppose you need the data of all the products.
Below code will get the data of all products.
import requests
from bs4 import BeautifulSoup
url = 'https://b2b.pmsinternational.com/search/?q=&submit=Search+Product+Name'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
section_products = soup.find('section', attrs={'class': 'products'})
uls = section_products.findAll('ul')
rows = []
columns = ['product title', 'Item Number', 'Product URL','Image URL']
for ul in uls:
for item in ul.findAll('li'):
product_url = item.a['href']
print(product_url)
product_image_url = item.img['src']
print(product_image_url)
product_title = item.find('span', class_='title').text.strip()
print(product_title)
product_number = item.find('span', class_='ref').text.strip()
print(product_number)
Upvotes: 0
Reputation: 20088
The src
attribute is within an img
tag that's within the a
. You need to first find()
the img
tag and then access the src
attribute.
Instead of:
product_image_url = item.a['src']
use:
product_image_url = item.a.find('img')['src']
the same goes for product_title
.
Instead of:
product_title = item.a['title']
use:
product_title = item.a.find('img')["title"]
But regarding product_number
, I don't see a ref
attribute, hence
product_number = item.a['ref']
causes an error.
Upvotes: 1