Reputation: 13
I am trying to parse an Amazon search results page. I want to access the data contained in an <li>
tag with <id=result_0>
, <id=result_1>
, <id=result_2>
, etc. The find_all('li')
function only returns 4 results (up to result_3), which I thought was odd, since when viewing the webpage in my browser, I see 12 results.
When I print parsed_html
, I see it contains all the way to result_23. Why isn't find_all returning all 24 objects? A snippet of my code is below.
import requests
try:
from BeautifulSoup import bsoup
except ImportError:
from bs4 import BeautifulSoup as bsoup
search_url = 'https://www.amazon.com/s/ref=nb_sb_noss_2?url=search-
alias%3Dstripbooks&field-keywords=data+analytics'
response = requests.get(search_url, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"})
parsed_html = bsoup(response.text)
results_tags = parsed_html.find_all('div',attrs={'id':'atfResults'})
results_html = bsoup(str(results_tags[0]))
results_html.find_all('li')
For what it's worth, the results_tags
object also only contains the 4 results. Which is why I am thinking the issue is in the find_all
step, rather than with the BeautifulSoup object.
If anyone can help me figure out what is happening here and how I can access all of the search results on this webpage, I will really appreciate it!!
Upvotes: 1
Views: 1109
Reputation: 12168
import requests, re
try:
from BeautifulSoup import bsoup
except ImportError:
from bs4 import BeautifulSoup as bsoup
search_url = 'https://www.amazon.com/s/?url=search-%20alias%3Dstripbooks&field-keywords=data+analytics' #delete the irrelevant part from url
response = requests.get(search_url, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36(KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36",
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8" }) # add 'Accept' header
parsed_html = bsoup(response.text, 'lxml')
lis = parsed_html.find_all('li', class_='s-result-item' ) # use class to find li tag
len(lis)
out:
25
Upvotes: 1
Reputation: 1351
Can access the li elements directly through class instead of id. This will print the text from each li element.
results_tags = parsed_html.find_all('li',attrs={'class':'s-result-item'})
for r in results_tags:
print(r.text)
Upvotes: 0