DownstairsPanda
DownstairsPanda

Reputation: 125

using beautifulsoup to open product pages in different tabs for an inputted search result in amazon

I'm quite new to python and very new to web scraping - currently following along in Al Sweigart's book Automate the Boring Stuff With Python and there is a suggested practice assignment which basically is to make a program that does this:

Heres my code:

#! python3
# Searches amazon for the inputted product (either through command line or input) and opens 5 tabs with the top 
# items for that search. 

    import requests, sys, bs4, webbrowser
    if len(sys.argv) > 1: # if there are system arguments
        res = requests.get('https://www.amazon.com/s?k=' + ''.join(sys.argv))
        res.raise_for_status
    else: # take input
        print('what product would you like to search Amazon for?')
        product = str(input())
        res = requests.get('https://www.amazon.com/s?k=' + ''.join(product))
        res.raise_for_status
    
    # retrieve top search links:
    soup = bs4.BeautifulSoup(res.text, 'html.parser')
    
    print(res.text) # TO CHECK HTML OF SITE, GET RID OF DURING ACTUAL PROGRAM
    # open a new tab for the top 5 items, and get the css selector for links 
    # a list of all things on the downloaded page that are within the css selector 'a-link-normal a-text-normal'
    linkElems = soup.select('a-link-normal a-text-normal') 
    
    numOpen = min(5, len(linkElems))
    for i in range(numOpen):
        urlToOpen = 'https://www.amazon.com/' + linkElems[i].get('href')
        print('Opening', urlToOpen)
        webbrowser.open(urlToOpen)

I think I've selected the correct css selector ("a-link-normal a-text-normal"), so I think the problem is with the res.text() - when I print to see what it looks like, the html content does not seem to be complete, or contain the contents of the actual html when I look at the same site using inspect element in chrome. Additionally, none of that html contains any contents such as "a-link-normal a-text-normal".

Just for a sample, this is what the res.text() looks like for a search for 'big pencil':

what product would you like to search Amazon for?
big pencil
<!--
        To discuss automated access to Amazon data please contact [email protected].
        For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_5_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_5_ac for advertising use cases.
-->
<!doctype html>
<html>
<head>
  <meta charset="utf-8">
  <meta http-equiv="x-ua-compatible" content="ie=edge">
  <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
  <title>Sorry! Something went wrong!</title>
  <style>
  html, body {
    padding: 0;
    margin: 0
  }

  img {
    border: 0
  }

  #a {
    background: #232f3e;
    padding: 11px 11px 11px 192px
  }

  #b {
    position: absolute;
    left: 22px;
    top: 12px
  }

  #c {
    position: relative;
    max-width: 800px;
    padding: 0 40px 0 0
  }

  #e, #f {
    height: 35px;
    border: 0;
    font-size: 1em
  }

  #e {
    width: 100%;
    margin: 0;
    padding: 0 10px;
    border-radius: 4px 0 0 4px
  }

  #f {
    cursor: pointer;
    background: #febd69;
    font-weight: bold;
    border-radius: 0 4px 4px 0;
    -webkit-appearance: none;
    position: absolute;
    top: 0;
    right: 0;
    padding: 0 12px
  }

  @media (max-width: 500px) {
    #a {
      padding: 55px 10px 10px
    }

    #b {
      left: 6px
    }
  }

  #g {
    text-align: center;
    margin: 30px 0
  }

  #g img {
    max-width: 90%
  }

  #d {
    display: none
  }

  #d[src] {
    display: inline
  }
  </style>
</head>
<body>
    <a href="/ref=cs_503_logo"><img id="b" src="https://images-na.ssl-images-amazon.com/images/G/01/error/logo._TTD_.png" alt="Amazon.com"></a>
    <form id="a" accept-charset="utf-8" action="/s" method="GET" role="search">
        <div id="c">
            <input id="e" name="field-keywords" placeholder="Search">
            <input name="ref" type="hidden" value="cs_503_search">
            <input id="f" type="submit" value="Go">
        </div>
    </form>
<div id="g">
  <div><a href="/ref=cs_503_link"><img src="https://images-na.ssl-images-amazon.com/images/G/01/error/500_503.png"
                                        alt="Sorry! Something went wrong on our end. Please go back and try again or go to Amazon's home page."></a>
  </div>
  <a href="/dogsofamazon/ref=cs_503_d" target="_blank" rel="noopener noreferrer"><img id="d" alt="Dogs of Amazon"></a>
  <script>document.getElementById("d").src = "https://images-na.ssl-images-amazon.com/images/G/01/error/" + (Math.floor(Math.random() * 43) + 1) + "._TTD_.jpg";</script>
</div>
</body>
</html>

Thank you so much for your patience.

Upvotes: 0

Views: 1563

Answers (1)

This is a classic case where you won't find anything if you try to scrape the site directly using a scraper like BeautifulSoup.

The way the site works is that an initial chunk of code is first downloaded to your browser the same as what you have added for big pencil and then via Javascript, the rest of the elements on the page are loaded.

You'll need to use Selenium Webdriver to first load the page and then fetch the code from the browser. In normal sense, it is equivalent of you opening the console of your browser, going to Elements tab and looking for the classes that you've mentioned.

To see the difference, I'll suggest you see the Source code of the page and compare with the code in the Elements tab

Out here, you'll need to fetch the data loaded on to the browser via BS4 by using

from selenium import webdriver

browser = webdriver.Chrome("path_to_chromedriver") # This is the Chromedriver which will open up a new instance of a browser for you. More info in the docs

browser.get(url) # Fetch the URL on the browser

soup = bs4.BeautifulSoup(browser.page_source, 'html.parser') # Now load it to BS4 and go on with extracting the elements and so on

This is a very basic code to understand Selenium, however, in production use-case, you may want to use a headless browser like PhantomJS

References:

Upvotes: 0

Related Questions