Reputation: 43
I've created my Custom SpiderMiddleware from OffsiteMiddleware. A simple copy and paste from the original class, maybe it exist a better method.
I would collect the filtered offsite domains. My pipeline works.
But i don't know how return the items to my pipeline.
Thanks for your help.
def process_spider_output(self, response, result, spider):
items = []
for x in result:
if isinstance(x, Request):
if x.dont_filter or self.should_follow(x, spider):
yield x
else:
domain = urlparse_cached(x).hostname
if domain and domain not in self.domains_seen[spider]:
self.domains_seen[spider].add(domain)
# ***My items ===> items.append(OutboundsLinks(url = domain))***
else:
yield x
Upvotes: 0
Views: 1313
Reputation: 4085
process_spider_output() must return an iterable of Request or Item objects.
def process_spider_output(self, response, result, spider):
items = []
for x in result:
if isinstance(x, Request):
if x.dont_filter or self.should_follow(x, spider):
yield x
else:
domain = urlparse_cached(x).hostname
if domain and domain not in self.domains_seen[spider]:
self.domains_seen[spider].add(domain)
# create an item here and yield it
else:
yield x
Upvotes: 1