Reputation: 51
I would scrape all the asin numbers from an amazon page. I need that lists to make a scraping for every asin obtained.
I tryed with this code but i could read only 3 asin number as results.
I think i make a wrong regular expression
this is my code:
import requests
###Amazon URL
urls = ['https://www.amazon.it/gp/bestsellers/apparel/', 'https://www.amazon.it/gp/bestsellers/electronics/', 'https://www.amazon.it/gp/bestsellers/books/']
htmltexts = []
for url in urls:
req = requests.get(url).content
htmltexts.append(req)
import re
for htmltext in htmltexts:
text = str(htmltext)
pattern = re.compile(r"/.*/dp/(.*?)\"")
s = re.findall(pattern, text)
print (s)
I expect at least 20 result from every page. The program has built for 3 amazon pages. so i need 60 results at least
Upvotes: 0
Views: 858
Reputation: 102
The issue with RegEx is that the /.*/
part in /.*/dp/(.*?)\"
means that it can match any set of symbols between /
and /
. In your case it matches most of the symbols in the response message.
Try the following RegEx: /[^/]+/dp/([^"]+)
, see code below. It gets 50 ASINs from each page:
import requests
import re
urls = [
'https://www.amazon.it/gp/bestsellers/apparel/',
'https://www.amazon.it/gp/bestsellers/electronics/',
'https://www.amazon.it/gp/bestsellers/books/'
]
for url in urls:
content = requests.get(url).content
decoded_content = content.decode()
asins = set(re.findall(r'/[^/]+/dp/([^"?]+)', decoded_content))
print(asins)
Upvotes: 1