Reputation: 964
I am working on a web scraping project. In this project, I am trying to scrape all the product links on a particular page in amazon. this process will repeat as many times as required and scrape multiple pages from amazon.
Here is my code so far
def scrape_pages(headers, product, num_of_pages):
product_links = []
for page in range(1, num_of_pages+1):
url = f'https://www.amazon.com/s?k={product}&page={page}&ref=nb_sb_noss'
print(url)
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, features="lxml")
data = soup.findAll('h2', attrs={'class':'a-size-mini'})
for i in data:
links = i.findAll('a')
for a in links:
product_links.append(f"https://www.amazon.com{a['href']}")
print('TOTAL NUMBER OF PRODUCTS LINKS SCRAPPED: ', len(product_links))
return product_links
In the above code, I am trying to scrape links inside all h2
tags in a page. I am using user-agent to make the scraping possible.
My problem is this code does not run all the time. Some times it scrapes some of the links and some times it does not scrape any links.
Each page in amazon has around 48 products listed. If I were to scrape 5 pages then the product_links
list should hold somewhere around 240 links in it. but I have done multiple tests and it is always less than 200 and sometimes it is 0.
I want to know what I am doing wrong
FYI this is the user-agent I am using
{'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
Upvotes: 0
Views: 84
Reputation: 392
I had same problem with before. You can use selenium with BeautifulSoup.
scroll the page up to end using selenium and after that you can use BeautifulSoup part to get whole source of the page. I tried it with google play. I can get the data as I expected.
Upvotes: 1
Reputation: 168
I am not sure about this but as many online retailers, such as Amazon put anti-bot software throughout the websites which might stop your crawler. These retailers will shut down any requests from Beautiful Soup as it knows that it does not come from legitimate browsers. You can use Selenium instead. Or put some constraints in your code such as
time.sleep(1)
to pause your code for a second so that you are not spamming the website with requests.
Upvotes: 3