titusAdam
titusAdam

Reputation: 809

Scrapy: how to extract multiple matching xpaths from one page?

I'm using scrapy to extract product data from a website. One webpage contains multiple products. The html of interest looks like this:

<div class="product  grid" 
      <h2 class="productname"> itemprop="name">Hammer </h2>
      <div class="description"> Nice hammer! </div>
</div>

<div class="product  grid" 
      <h2 class="productname"> itemprop="name">Screwdriver </h2>
      <div class="description"> Cool screwdriver!</div>
</div>

Some products don't have a description and will look like this:

<div class="product  grid" 
      <h2 class="productname"> itemprop="name">Nails </h2>
</div>

Q: What would my parse method look like, in order to extract the products and their descriptions and store them into an array or file? Where the array would look like this:

array = [["product1","description1"],["product2","description2"], ..., ["productN","descriptionN"]]

I know how to extract an array A that contains just the products and I know how to extract an array B with just the descriptions. However, since there are products without a description, C = A + B would result in mismatches. So I need to find a way to match a product with a description, only if it has one.

Upvotes: 1

Views: 807

Answers (1)

alecxe
alecxe

Reputation: 474151

Iterate over products and locate the product names and descriptions:

$ scrapy shell file://$PWD/index.html
In [1]: [
   ...:     (item.css(".productname::text").extract_first(), 
   ...:      item.css(".description::text").extract_first()) 
   ...:     for item in response.css(".product")
   ...: ]
Out[1]: 
[(u'Hammer', u' Nice hammer! '),
 (u'Screwdriver', u'Cool screwdriver!'),
 (u'Nails', None)]

Note the None description value if it is not present.

Working with this HTML sample based on your examples:

<div>
    <div class="product  grid">
      <h2 class="productname" itemprop="name">Hammer</h2>
      <div class="description"> Nice hammer! </div>
    </div>

    <div class="product  grid">
          <h2 class="productname" itemprop="name">Screwdriver</h2>
          <div class="description">Cool screwdriver!</div>
    </div>

    <div class="product  grid">
      <h2 class="productname" itemprop="name">Nails</h2>
    </div>
</div>

Upvotes: 3

Related Questions