Reputation: 25
EDIT: Solved! For those who encounter this in their learning; The answer is bellow, nicely explained and provided by Paul.
This is my first question here and I've searched and searched (for two days so far) to no avail. I am trying to scrape a particular retail website to get the product name and the price.
Currently, I have one spider working on one particular retail website however, with another retail website, it sort of works. I can get the product name correctly, but I cannot get the price in the right format.
Firstly, this is my spider code currently:
import scrapy
from projectname.items import projectItem
class spider_whatever(scrapy.Spider):
name = "whatever"
allowed_domain = ["domain.com"]
start_urls = ["http://www.domain.com"]
def parse(self, response):
sel = scrapy.Selector(response)
requests = sel.xpath('//div@class="container"]')
product = requests.xpath('.//*[@class="productname"/text()]').extract()
price = requests.xpath('.//*[@class="price"]').extract() #Issue lies here.
itemlist = []
for product, price in zip(product, price):
item = projectItem()
item['product'] = product.strip().upper()
item['price'] = price.strip()
itemlist.append(item)
return itemlist
Now the target HTML for the price is:
<div id="listPrice1" class="price">
$622 <div class="cents">.00</div>
</div>
As you can see, not only is it messy, it has a div within the div I want to reference. Now when I go and try and do this:
price = requests.xpath('.//*[@class="price"]/text()').extract()
It spits out this:
product,price
some_product1, $100
some_product2,
some_product3, $200
some_product4,
When it is supposed to spit out:
product,price
some_product1, $100
some_product2, $200
some_product3, $300
some_product4, $400
What I think it is doing is; it is also extracting div class="cents" and assigning that to the next product therefore pushing the next value down one.
When I try and scrape the data via Google Docs Spreadsheet, it puts the product in one column, and the price is split into two columns; the first being the $amount and the second being the .00 cents as shown below:
product,price,cents
some_product1, $100, .00
some_product2, $200, .00
some_product3, $300, .00
some_product4, $400, .00
So my question is, how do I separate the div in a div. Is there a particular way of excluding it from the XPath or can I filter it out when I parse the data? And if I can filter it out, how would I do this?
Any help is much appreciated. Please understand, I am relatively new to Python and am trying my best to learn.
Upvotes: 0
Views: 5558
Reputation: 20748
Let's explore a few different XPath patterns:
>>> import scrapy
>>> selector = scrapy.Selector(text="""<div id="listPrice1" class="price">
... $622 <div class="cents">.00</div>
... </div>""")
# /text() will select all text nodes under the context not,
# here any element with class "price"
# there are 2 of them
>>> selector.xpath('.//*[@class="price"]/text()').extract()
[u'\n $622 ', u'\n ']
# if you wrap the context node inside the "string()" function,
# you'll get the string representation of the node,
# basically a concatenation of text elements
>>> selector.xpath('string(.//*[@class="price"])').extract()
[u'\n $622 .00\n ']
# using "normalize-space()" instead of "string()",
# it will replace multiple space with 1 space character
>>> selector.xpath('normalize-space(.//*[@class="price"])').extract()
[u'$622 .00']
# you could also ask for the 1st text node under the element with class "price"
>>> selector.xpath('.//*[@class="price"]/text()[1]').extract()
[u'\n $622 ']
# space-normalized version of that may do what you want
>>> selector.xpath('normalize-space(.//*[@class="price"]/text()[1])').extract()
[u'$622']
>>>
So, in the end, you may be after this pattern:
def parse(self, response):
sel = scrapy.Selector(response)
requests = sel.xpath('//div@class="container"]')
itemlist = []
for r in requests:
item = projectItem()
item['product'] = r.xpath('normalize-space(.//*[@class="productname"])').extract()
item['price'] = r.xpath('normalize-space(.//*[@class="price"]/text()[1])').extract()
itemlist.append(item)
return itemlist
Upvotes: 4