Reputation: 300
In this post, a solution on how to scrape Amazon.com Product information/Product details table is given by alecxe. However, the format of that description table is different than many of the newer items listed on amazon.
The old format, which you can see here, is different than the new format here.
What I tried: In the code given by alecxe he uses
for li in soup.select('table#productDetailsTable div.content ul li'):
I tried changing this to (and removed everything after it):
for tr in soup.select('table#productDetails_detailBullets_sections1 tbody tr'):
print text.tr
print(repr(tr))
to see if I would be able to extract at least something from the product information table. However, nothing printed.
I also tried the find_all()
and find()
functions but I was unable to extract what I needed or even close to what I needed.
My issue with figuring this out is caused by the structure of the HTML for the new tables. It look something like:
<table ... >
<tbody>
.
.
.
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Best Sellers Rank
</th>
<td>
<span>
<span>#8,740 in Toys & Games (<a href="/gp/bestsellers/toys-and-games/ref=pd_dp_ts_toys-and-games_1">See Top 100 in Toys & Games</a>)</span>
<br>
<span>#67 in <a href="/gp/bestsellers/toys-and-games/ref=pd_zg_hrsr_toys-and-games_1_1">Toys & Games</a> > <a href="/gp/bestsellers/toys-and-games/166359011/ref=pd_zg_hrsr_toys-and-games_1_2">Puzzles</a> > <a href="/gp/bestsellers/toys-and-games/166363011/ref=pd_zg_hrsr_toys-and-games_1_3_last">Jigsaw Puzzles</a></span>
<br>
<span>#87 in <a href="/gp/bestsellers/toys-and-games/ref=pd_zg_hrsr_toys-and-games_2_1">Toys & Games</a> > <a href="/gp/bestsellers/toys-and-games/251909011/ref=pd_zg_hrsr_toys-and-games_2_2">Preschool</a> > <a href="/gp/bestsellers/toys-and-games/251910011/ref=pd_zg_hrsr_toys-and-games_2_3">Pre-Kindergarten Toys</a> > <a href="/gp/bestsellers/toys-and-games/251942011/ref=pd_zg_hrsr_toys-and-games_2_4_last">Puzzles</a></span>
<br>
</span>
</td>
</tr>
.
.
.
</tbody>
</table>
If I want to just extract the seller rank for "Toys & Games > Puzzles > Jigsaw Puzzles" How am I supposed to do that? (The text in the second , at least in this case, in the HTML above)
Upvotes: 0
Views: 272
Reputation: 15376
I could make your code work with some small adjustments :
soup.select
, it's a tag generated by the browser tr.text
not text.tr
Code :
for tr in soup.select('table#productDetails_detailBullets_sections1 tr'):
if 'Jigsaw Puzzles' in tr.text :
print(tr.text.strip())
Or if you prefer find
/ find_all
:
for tr in soup.find('table', id='productDetails_detailBullets_sections1').find_all('tr') :
if 'Jigsaw Puzzles' in tr.text :
for span in tr.find('span').find_all('span') :
if 'Jigsaw Puzzles' in span.text :
print(span.text.strip())
Upvotes: 1