Reputation: 23
Could you please help me to get the name of the products?
You can see the path on the picture .
Here the name of product is Samsung Galaxy....
and other codes
import pandas as pd
import requests # Import the library for sending requests to the server
from bs4 import BeautifulSoup # Import the library for webpage parsing
URL='https://www.amazon.com/s?k=samsung+tablet&crid=3VMSMTMZYOP78&sprefix=samsung+%2Caps%2C273&ref=nb_sb_ss_ts-doa-p_2_8'
req = requests.get(URL) # GET-request
soup = BeautifulSoup(req.text, 'lxml')
soup.find_all('span', attrs={'class_':'a-size-medium a-color-base a-text-normal'})
I get an empty list. I do not understand why this is the case.
Upvotes: 0
Views: 823
Reputation: 213
Check these it will work :
from bs4 import BeautifulSoup
import requests
import pandas as pd
products=[]
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0",
"Accept-Encoding": "gzip, deflate",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"DNT": "1",
"Connection": "close",
"Upgrade-Insecure-Requests": "1",
}
for page in range(1, 5):
cookies = {'session': '17ab96bd8ffbe8ca58a78657a918558'}
r = requests.get(
"https://www.amazon.com/s?k=samsung+tablet&crid=3VMSMTMZYOP78&sprefix=samsung+%2Caps%2C273&ref=nb_sb_ss_ts-doa-p_2_8={page}".format(
page=page
),
headers=headers,
cookies =cookies
)
soup = BeautifulSoup(r.content, "lxml")
for d in soup.select(".s-result-item[data-component-type='s-search-result']"):
name=d.find('h2')
if name is not None:
products.append(name.text)
else:
products.append("-")
df = pd.DataFrame({'Product Name':products})
print(df)
0
output:
Product Name
0 SAMSUNG Galaxy Tab S7 FE 2021 Android Tablet 1...
1 Samsung Galaxy Tab A7 10.4 Wi-Fi 32GB Silver (...
2 Samsung Galaxy Tab A7 10.4 Wi-Fi 32GB Silver (...
3 Samsung Tab A7 Lite 8.7" Gray 32GB (SM-T220NZA...
4 SAMSUNG Galaxy Tab A 8.0-inch Android Tablet 6...
.. ...
83 2020 Samsung Galaxy Tab A7 10.4�� (2000x1200) ...
84 Samsung Galaxy Tab S6 Lite 10.4�� Touchscreen ...
85 Samsung Galaxy Tab S6 Lite 10.4", 64GB Wi-Fi T...
86 SAMSUNG Galaxy Tab S7 11-inch Android Tablet 1...
87 SAMSUNG Galaxy S20 FE 5G Factory Unlocked Andr...
Upvotes: 1
Reputation: 9377
You should probably adjust your URL
for scraping.
When I run a curl request using this URL, the responded HTML does not contain the expected <span class="a-size-medium a-color-base a-text-normal">
.
curl 'https://www.amazon.com/s?k=samsung+tablet&crid=3VMSMTMZYOP78&sprefix=samsung+%2Caps%2C273&ref=nb_sb_ss_ts-doa-p_2_8' | grep "<span class="
But only following spans:
<span class="a-button a-button-primary a-span12">
<span class="a-button-inner">
<span class="a-letter-space"></span>
<span class="a-letter-space"></span>
<span class="a-letter-space"></span>
<span class="a-letter-space"></span>
You can also test the soup
as HedgeHog commented:
import requests # Import the library for sending requests to the server
from bs4 import BeautifulSoup # Import the library for webpage parsing
url ='https://www.amazon.com/s?k=samsung+tablet&crid=3VMSMTMZYOP78&sprefix=samsung+%2Caps%2C273&ref=nb_sb_ss_ts-doa-p_2_8'
response = requests.get(url) # GET-request
soup = BeautifulSoup(response.text, 'html') # adjusted from lxml to html
print(soup) # contains span elements but not expected
elements = soup.find_all('span', attrs={'class_':'a-size-medium a-color-base a-text-normal'})
print(elements) # empty list, the expected spans were not found
You will discover a robot-prevention, probably using a captcha to verify a human is using a browser:
<h4>Enter the characters you see below</h4>
<p class="a-last">Sorry, we just need to make sure you're not a robot. For best results, please make sure your browser is accepting cookies.</p>
Fun fact: You can copy & paste or write the resulting HTML to a file and open in your browser. It shows the guarding dogs of Amazon:
See also All The Dogs You Can Meet If You're Trying To Get On Amazon But It's Broken
Further research suggested to add 2 headers to the request (that your browser automatically adds, too):
User-Agent
(can simulate a specific browser and OS/platform)Accept-Language
(is required by most e-commerce pages to localize content)In requests you can add them as dictionary like:
HEADERS = ({
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
'Accept-Language': 'en-US, en;q=0.5'
})
response = requests.get(URL, headers=HEADERS)
See:
Upvotes: 1