Reputation: 1573
I have the following webpage Product page and I'm trying to get the ASIN from it (in this case ASIN=B014MHZ90M) and I don't have a clue on how to get it from the page.
I'm using Python 3.4, Scrapy and the following code:
hxs = Selector(response)
product_name = "".join(hxs.xpath('//span[contains(@class,"a-text-ellipsis")]/a/text()').extract())
product_model = hxs.xpath('//body//div[@id="buybox_feature_div"]//form[@method="post"]/input[@id="ASIN"/text()').extract()
In this way I don't get the required field (the ASIN number).
2.Is there a way to debug such code (I'm using PyCharm). I could not use debugger but only run it without seeing what's going on there in 'slow motion'.
Upvotes: 3
Views: 5352
Reputation: 5941
https://www.amazon.com/gp/seller/asin-upc-isbn-info.html
Amazon Standard Identification Numbers (ASINs) are unique blocks of 10 letters and/or numbers that identify items.
Your best option and probably the easiest one is to run a regex on the URL looking for a 10 char string between two "/".
'/\w{10}/'
You can then simply omit the "/"s from the result.
Upvotes: 0
Reputation: 672
I use this:
re.match("http[s]?://www.amazon.(\w+)(.*)/(dp|gp/product)/(?P<asin>\w+).*", url, flags=re.IGNORECASE)
Upvotes: 2
Reputation:
You can get that from the url.
r = re.search('www.amazon.com/dp/(.+)/', response.url)
print r.group(1)
Upvotes: 0
Reputation: 976
Looking at the Amazon page you linked, the ASIN number appears in the "Product Details" section. Using the scrapy shell the following xpath
response.xpath('//li[contains(.,"ASIN: ")]//text()').extract()
returns
[u'ASIN: ', u'B014MHZ90M']
For debugging XPATHs I always use scrapy shell
and Firebug for Firefox.
Upvotes: 3
Reputation:
you can extract B014MHZ90M from the response.url
response.url.split("/dp/")[1]
response.url.split("/dp/")[1] = B014MHZ90M
response.url.split("/dp/")[0] = http://www.amazon.com
Upvotes: 4