default-303
default-303

Reputation: 395

Always getting result as None while web scraping using bs4

I'm new to python. I just started learning web scraping, i decided to do web scrape amazon for the name of the product listed. So i fired up chrome dev tools and click inspect on the amazon product name and then noted the class, in this case the name of the class is 'a-link-normal'. The problem is i get the result as None. Here is the code -

import webbrowser
import requests
from bs4 import BeautifulSoup

source = requests.get('https://www.amazon.in/s?k=books&ref=nb_sb_noss')
soup = BeautifulSoup(source.text, 'lxml')

name = soup.find('a', class_ = 'a-link-normal')
print(name)

here is the screen shot of what im inspecting - link to image

I'm new to web-scraping and is overwhelmed by the complexity of websites, so please give any advice if you wish

Any help is appreciated Thanks

Upvotes: 2

Views: 3514

Answers (2)

Andrej Kesely
Andrej Kesely

Reputation: 195603

To get a correct response from Amazon server, use User-Agent HTTP header:

import requests
from bs4 import BeautifulSoup


headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
source = requests.get('https://www.amazon.in/s?k=books&ref=nb_sb_noss', headers=headers)
soup = BeautifulSoup(source.text, 'lxml')

for a in soup.select('a.a-link-normal > span.a-size-medium'):
    print(a.get_text(strip=True))

Prints:

The Power of Your Subconscious Mind (DELUXE HARDBOUND EDITION)
World’s Greatest Books For Personal Growth & Wealth (Set of 4 Books): Perfect Motivational Gift Set
Ikigai: The Japanese secret to a long and happy life
Attitude Is Everything: Change Your Attitude ... Change Your Life!
World’s Greatest Books For Personal Growth & Wealth (Set of 4 Books): Perfect Motivational Gift Set
The Theory of Everything
The Subtle Art of Not Giving a F*ck
The Alchemist
The Monk Who Sold His Ferrari
The Rudest Book Ever
As a Man Thinketh
How to Stop Worrying and Start Living: Time-Tested Methods for Conquering Worry
Help Hungry Henry Deal with Anger : An Interactive Picture Book About Anger Management
The Girl in Room 105
The Blue Umbrella
Wings of Fire: An Autobiography of Abdul Kalam
My First Library: Boxset of 10 Board Books for Kids
Who Will Cry When You Die?
Rich Dad Poor Dad : What The Rich Teach Their Kids About Money That the Poor and Middle Class Do Not!
Rough Book
The Leader Who Had No Title
The Power Of Influence

Upvotes: 0

Pooria_T
Pooria_T

Reputation: 156

It seems that Amazon blocks any crawling, I check it and when you run the code for the first time, the content can be extracted. Whenever, the code is run immediately for the second time, it will be blocked. If you print out the soup variable, you will be faced with below notification:

To discuss automated access to Amazon data please contact [email protected]. For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.in/ref=rm_c_sv, or our Product Advertising API at https://affiliate-program.amazon.in/gp/advertising/api/detai /main.html/ref=rm_c_ac for advertising use cases.

Sorry, we just need to make sure you're not a robot. For best results, please make sure your browser is accepting cookies.

I recommend you to use Selenium Library instead with considering some delays in your code to act like human's interaction.

However, try to run the below code one in several minutes, you can extract the title of books:

import requests
from bs4 import BeautifulSoup

source = requests.get('https://www.amazon.in/s?k=books&ref=nb_sb_noss')
soup = BeautifulSoup(source.content, 'html.parser')
#print(soup)

names = soup.find_all('span', class_="a-size-medium a-color-base a-text-normal")
for name in names:
    print(name.text)

Upvotes: 2

Related Questions