Reputation: 45
I am currently working my way through learning the basics of web-scraping via Scraby, and have run into a particular issue of items being duplicated rather than expanded.
The first page I scrape data from has a selection of links that I need to follow to scrape an additional link from. These links are stored as item['link'].
My issue is that by iterating over these links, via requests nested inside a loop, the results are not being appended to the original item instance but are instead being returned as new instances.
The results therefore look a bit like the following:
{'date': [u'29 June 2015', u'15 September 2015'],
'desc': [u'Audit Committee - 29 June 2015',
u'Audit Committee - 15 September 2015'],
'link': [u'/Council/Council-and-Committee-Minutes/Audit-Committee/2015/Audit-Committee-29-June-2015',
u'/Council/Council-and-Committee-Minutes/Audit-Committee/2015/Audit-Committee-15-September-2015'],
'pdf_url': 'http://www.antrimandnewtownabbey.gov.uk/Council/Council-and-Committee-Minutes/Audit-Committee/2015/Audit-Committee-15-September-2015',
'title': [u'2015']}
{'date': [u'29 June 2015', u'15 September 2015'],
'desc': [u'Audit Committee - 29 June 2015',
u'Audit Committee - 15 September 2015'],
'link': [u'/Council/Council-and-Committee-Minutes/Audit-Committee/2015/Audit-Committee-29-June-2015',
u'/Council/Council-and-Committee-Minutes/Audit-Committee/2015/Audit-Committee-15-September-2015'],
'pdf_url': 'http://www.antrimandnewtownabbey.gov.uk/Council/Council-and-Committee-Minutes/Audit-Committee/2015/Audit-Committee-29-June-2015',
'title': [u'2015']}
where as I want them to be contained in the same object like the following:
{'date': [u'29 June 2015', u'15 September 2015'],
'desc': [u'Audit Committee - 29 June 2015',
u'Audit Committee - 15 September 2015'],
'link': [u'/Council/Council-and-Committee-Minutes/Audit-Committee/2015/Audit-Committee-29-June-2015',
u'/Council/Council-and-Committee-Minutes/Audit-Committee/2015/Audit-Committee-15-September-2015'],
'pdf_url': [u'http://www.antrimandnewtownabbey.gov.uk/Council/Council-and-Committee-Minutes/Audit-Committee/2015/Audit-Committee-29-June-2015',
u'http://www.antrimandnewtownabbey.gov.uk/Council/Council-and-Committee-Minutes/Audit-Committee/2015/Audit-Committee-15-September-2015'],
'title': [u'2015']}
Here's my current implementations (based mainly of the Scrapy tutorials):
def parse(self, response):
for sel in response.xpath('//div[@class="lower-col-right"]'):
item = CouncilExtractorItem()
item['title'] = sel.xpath('header[@class="intro user-content font-set clearfix"] /h1/text()').extract()
item['link'] = sel.xpath('div[@class="user-content"] /section[@class="listing-item"]/a/@href').extract()
item['desc'] = sel.xpath('div[@class="user-content"] /section[@class="listing-item"]/a/h2/text()').extract()
item['date'] = sel.xpath('div[@class="user-content"] /section[@class="listing-item"]/span/text()').extract()
for url in item['link']:
full_url = response.urljoin(url)
request = scrapy.Request(full_url, callback=self.parse_page2)
request.meta['item'] = item
yield request
def parse_page2(self, response):
item = response.meta['item']
item['pdf'] = response.url
return item
Upvotes: 3
Views: 431
Reputation: 4501
The issue is a combination of the following two chunks of code working together:
for url in item['link']:
full_url = response.urljoin(url)
request = scrapy.Request(full_url, callback=self.parse_page2)
request.meta['item'] = item
yield request
and
def parse_page2(self, response):
item = response.meta['item']
item['pdf'] = response.url
return item
You're creating new Requests with item
as a meta element for each URL, you're then you're replacing that item's 'pdf' field, then yielding that item. The end result: for each URL, you get a new duplicate item with a different PDF field.
As is, Scrapy has no way of knowing what you intend to do with the item. You'll need to change your code to: A) keep track of all URLs, and only yield once they've all been processed, and B) append to item['pdf']
, not overwrite it.
Upvotes: 1
Reputation: 474003
You need to make your inner XPath expressions context-specific by prepending a dot:
for sel in response.xpath('//div[@class="lower-col-right"]'):
item = CouncilExtractorItem()
item['title'] = sel.xpath('.//header[@class="intro user-content font-set clearfix"]/h1/text()').extract()
item['link'] = sel.xpath('.//div[@class="user-content"]/section[@class="listing-item"]/a/@href').extract()
item['desc'] = sel.xpath('.//div[@class="user-content"]/section[@class="listing-item"]/a/h2/text()').extract()
item['date'] = sel.xpath('.//div[@class="user-content"]/section[@class="listing-item"]/span/text()').extract()
# ...
Upvotes: 1