Reputation: 809
I'm using scrapy to extract product data from a website. One webpage contains multiple products. The html of interest looks like this:
<div class="product grid"
<h2 class="productname"> itemprop="name">Hammer </h2>
<div class="description"> Nice hammer! </div>
</div>
<div class="product grid"
<h2 class="productname"> itemprop="name">Screwdriver </h2>
<div class="description"> Cool screwdriver!</div>
</div>
Some products don't have a description and will look like this:
<div class="product grid"
<h2 class="productname"> itemprop="name">Nails </h2>
</div>
Q: What would my parse method look like, in order to extract the products and their descriptions and store them into an array or file? Where the array would look like this:
array = [["product1","description1"],["product2","description2"], ..., ["productN","descriptionN"]]
I know how to extract an array A that contains just the products and I know how to extract an array B with just the descriptions. However, since there are products without a description, C = A + B would result in mismatches. So I need to find a way to match a product with a description, only if it has one.
Upvotes: 1
Views: 807
Reputation: 474151
Iterate over products and locate the product names and descriptions:
$ scrapy shell file://$PWD/index.html
In [1]: [
...: (item.css(".productname::text").extract_first(),
...: item.css(".description::text").extract_first())
...: for item in response.css(".product")
...: ]
Out[1]:
[(u'Hammer', u' Nice hammer! '),
(u'Screwdriver', u'Cool screwdriver!'),
(u'Nails', None)]
Note the None
description value if it is not present.
Working with this HTML sample based on your examples:
<div>
<div class="product grid">
<h2 class="productname" itemprop="name">Hammer</h2>
<div class="description"> Nice hammer! </div>
</div>
<div class="product grid">
<h2 class="productname" itemprop="name">Screwdriver</h2>
<div class="description">Cool screwdriver!</div>
</div>
<div class="product grid">
<h2 class="productname" itemprop="name">Nails</h2>
</div>
</div>
Upvotes: 3