Andrea Ventura
Andrea Ventura

Reputation: 51

Scaper of ASIN number from an Amazon page using python

I would scrape all the asin numbers from an amazon page. I need that lists to make a scraping for every asin obtained.

I tryed with this code but i could read only 3 asin number as results.

I think i make a wrong regular expression

this is my code:

import requests

###Amazon URL
urls = ['https://www.amazon.it/gp/bestsellers/apparel/', 'https://www.amazon.it/gp/bestsellers/electronics/', 'https://www.amazon.it/gp/bestsellers/books/']

htmltexts = []
for url in urls:
    req = requests.get(url).content
    htmltexts.append(req)

import re
for htmltext in htmltexts:
    text = str(htmltext)
    pattern = re.compile(r"/.*/dp/(.*?)\"")
    s = re.findall(pattern, text)
    print (s)

I expect at least 20 result from every page. The program has built for 3 amazon pages. so i need 60 results at least

Upvotes: 0

Views: 858

Answers (1)

Dainis
Dainis

Reputation: 102

The issue with RegEx is that the /.*/ part in /.*/dp/(.*?)\" means that it can match any set of symbols between / and /. In your case it matches most of the symbols in the response message.

Try the following RegEx: /[^/]+/dp/([^"]+), see code below. It gets 50 ASINs from each page:

import requests
import re

urls = [
    'https://www.amazon.it/gp/bestsellers/apparel/',
    'https://www.amazon.it/gp/bestsellers/electronics/',
    'https://www.amazon.it/gp/bestsellers/books/'
]

for url in urls:
    content = requests.get(url).content
    decoded_content = content.decode()

    asins = set(re.findall(r'/[^/]+/dp/([^"?]+)', decoded_content))
    print(asins)

Upvotes: 1

Related Questions