dmzavelsky
dmzavelsky

Reputation: 13

Why is Beautifulsoup find_all not returning complete results?

I am trying to parse an Amazon search results page. I want to access the data contained in an <li> tag with <id=result_0>, <id=result_1>, <id=result_2>, etc. The find_all('li') function only returns 4 results (up to result_3), which I thought was odd, since when viewing the webpage in my browser, I see 12 results.

When I print parsed_html, I see it contains all the way to result_23. Why isn't find_all returning all 24 objects? A snippet of my code is below.

import requests

try: 
    from BeautifulSoup import bsoup
except ImportError:
    from bs4 import BeautifulSoup as bsoup

search_url = 'https://www.amazon.com/s/ref=nb_sb_noss_2?url=search-
              alias%3Dstripbooks&field-keywords=data+analytics'
response = requests.get(search_url, headers={
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36
        (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"})
parsed_html = bsoup(response.text)
results_tags = parsed_html.find_all('div',attrs={'id':'atfResults'})
results_html = bsoup(str(results_tags[0]))
results_html.find_all('li')

For what it's worth, the results_tags object also only contains the 4 results. Which is why I am thinking the issue is in the find_all step, rather than with the BeautifulSoup object.

If anyone can help me figure out what is happening here and how I can access all of the search results on this webpage, I will really appreciate it!!

Upvotes: 1

Views: 1109

Answers (2)

宏杰李
宏杰李

Reputation: 12168

import requests, re

try: 
    from BeautifulSoup import bsoup
except ImportError:
    from bs4 import BeautifulSoup as bsoup

search_url = 'https://www.amazon.com/s/?url=search-%20alias%3Dstripbooks&field-keywords=data+analytics' #delete the irrelevant part from url
response = requests.get(search_url, headers={
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36(KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36",
        "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8" })  # add 'Accept' header
parsed_html = bsoup(response.text, 'lxml')
lis = parsed_html.find_all('li', class_='s-result-item' ) # use class to find li tag
len(lis)

out:

25

Upvotes: 1

Matts
Matts

Reputation: 1351

Can access the li elements directly through class instead of id. This will print the text from each li element.

results_tags = parsed_html.find_all('li',attrs={'class':'s-result-item'})
for r in results_tags:
    print(r.text)

Upvotes: 0

Related Questions