Reputation: 178
What's the best approach to write contracts for Scrapy spiders that have more than one method to parse the response? I saw this answer but it didn't sound very clear to me.
My current example: I have a method called parse_product
that extracts the information on a page but I have more data that I need to extract for the same product in another page, so I yield
a new request at the end of this method to make a new request and let the new callback extracts theses fields and returns the item.
The problem is that if I write a contract for the second method, it will fail because it doesn't have the meta attribute (containing the item with most of the fields). If I write a contract for the first method, I can't check if it returns the fields, because it returns a new request, instead of the item.
def parse_product(self, response):
il = ItemLoader(item=ProductItem(), response=response)
# populate the item in here
# yield the new request sending the ItemLoader to another callback
yield scrapy.Request(new_url, callback=self.parse_images, meta={'item': il})
def parse_images(self, response):
"""
@url http://foo.bar
@returns items 1 1
@scrapes field1 field2 field3
"""
il = response.request.meta['item']
# extract the new fields and add them to the item in here
yield il.load_item()
In the example, I put the contract in the second method, but it gave me a KeyError
exception on response.request.meta['item']
, also, the fields field1
and field2
are populated in the first method.
Hope it's clear enough.
Upvotes: 4
Views: 1559
Reputation: 23846
Frankly, I don't use Scrapy contracts and I don't really recommend anyone to use them either. They have many issues and someday may be removed from Scrapy.
In practice, I haven't had much luck using unit tests for spiders.
For testing spiders during development, I'd enable the cache and then re-run the spider as many times as needed to get the scraping right.
For regression bugs, I had better luck using item pipelines (or spider middlewares) that do validation on-the-fly (there is only so much you can catch in early testing anyway). It's also a good idea to have some strategies for recovering.
And for maintaining a healthy codebase, I'd be constantly moving library-like code out from the spider itself to make it more testable.
Sorry if this isn't the answer you're looking for.
Upvotes: 5