user2330624
user2330624

Reputation: 300

How to scrape the new format for Product information on Amazon.com using BeautifulSoup?

In this post, a solution on how to scrape Amazon.com Product information/Product details table is given by alecxe. However, the format of that description table is different than many of the newer items listed on amazon.

The old format, which you can see here, is different than the new format here.

What I tried: In the code given by alecxe he uses

for li in soup.select('table#productDetailsTable div.content ul li'):

I tried changing this to (and removed everything after it):

for tr in soup.select('table#productDetails_detailBullets_sections1 tbody tr'):
    print text.tr
    print(repr(tr))

to see if I would be able to extract at least something from the product information table. However, nothing printed.

I also tried the find_all() and find() functions but I was unable to extract what I needed or even close to what I needed.

My issue with figuring this out is caused by the structure of the HTML for the new tables. It look something like:

<table ... >
<tbody>
.
.
.    
<tr>
    <th class="a-color-secondary a-size-base prodDetSectionEntry">
        Best Sellers Rank
    </th>
    <td>
         <span>

                <span>#8,740 in Toys &amp; Games (<a href="/gp/bestsellers/toys-and-games/ref=pd_dp_ts_toys-and-games_1">See Top 100 in Toys &amp; Games</a>)</span>
        <br>

                <span>#67 in <a href="/gp/bestsellers/toys-and-games/ref=pd_zg_hrsr_toys-and-games_1_1">Toys &amp; Games</a> &gt; <a href="/gp/bestsellers/toys-and-games/166359011/ref=pd_zg_hrsr_toys-and-games_1_2">Puzzles</a> &gt; <a href="/gp/bestsellers/toys-and-games/166363011/ref=pd_zg_hrsr_toys-and-games_1_3_last">Jigsaw Puzzles</a></span>
        <br>

                <span>#87 in <a href="/gp/bestsellers/toys-and-games/ref=pd_zg_hrsr_toys-and-games_2_1">Toys &amp; Games</a> &gt; <a href="/gp/bestsellers/toys-and-games/251909011/ref=pd_zg_hrsr_toys-and-games_2_2">Preschool</a> &gt; <a href="/gp/bestsellers/toys-and-games/251910011/ref=pd_zg_hrsr_toys-and-games_2_3">Pre-Kindergarten Toys</a> &gt; <a href="/gp/bestsellers/toys-and-games/251942011/ref=pd_zg_hrsr_toys-and-games_2_4_last">Puzzles</a></span>
        <br>

        </span>
    </td>
    </tr>
.
. 
.
</tbody>
</table>

If I want to just extract the seller rank for "Toys & Games > Puzzles > Jigsaw Puzzles" How am I supposed to do that? (The text in the second , at least in this case, in the HTML above)

Upvotes: 0

Views: 272

Answers (1)

t.m.adam
t.m.adam

Reputation: 15376

I could make your code work with some small adjustments :

  1. Remove 'tbody' in soup.select , it's a tag generated by the browser
  2. Print tr.text not text.tr

Code :

for tr in soup.select('table#productDetails_detailBullets_sections1 tr'):
    if 'Jigsaw Puzzles' in tr.text :
        print(tr.text.strip())

Or if you prefer find / find_all :

for tr in soup.find('table', id='productDetails_detailBullets_sections1').find_all('tr') :
    if 'Jigsaw Puzzles' in tr.text : 
        for span in tr.find('span').find_all('span') : 
            if 'Jigsaw Puzzles' in span.text : 
                print(span.text.strip())

Upvotes: 1

Related Questions