Reputation: 257
I'm relatively a noob at python and it's my first time learning scrapy. I've done data mining with perl quite successfully before, but this is a whole different ballgame!
I'm trying to scrape a table, grab the columns of each row. My code is below.
items.py
from scrapy.item import Item, Field
class Cio100Item(Item):
company = Field()
person = Field()
industry = Field()
url = Field()
scrape.py (the spider)
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from cio100.items import Cio100Item
items = []
class MySpider(BaseSpider):
name = "scrape"
allowed_domains = ["cio.co.uk"]
start_urls = ["http://www.cio.co.uk/cio100/2013/cio/"]
def parse(self, response):
sel = Selector(response)
tables = sel.xpath('//table[@class="bgWhite listTable"]//h2')
for table in tables:
# print table
item = Cio100Item()
item['company'] = table.xpath('a/text()').extract()
item['person'] = table.xpath('a/text()').extract()
item['industry'] = table.xpath('a/text()').extract()
item['url'] = table.xpath('a/@href').extract()
items.append(item)
return items
I'm have some trouble understanding how to articulate the xpath selection correctly.
I think this line is the problem:
tables = sel.xpath('//table[@class="bgWhite listTable"]//h2')
When I run the scraper as is above the result is I get things like this in terminal:
2014-01-13 22:13:29-0500 [scrape] DEBUG: Scraped from <200 http://www.cio.co.uk/cio100/2013/cio/>
{'company': [u"\nDomino's Pizza\n"],
'industry': [u"\nDomino's Pizza\n"],
'person': [u"\nDomino's Pizza\n"],
'url': [u'/cio100/2013/dominos-pizza/']}
2014-01-13 22:13:29-0500 [scrape] DEBUG: Scraped from <200 http://www.cio.co.uk/cio100/2013/cio/>
{'company': [u'\nColin Rees\n'],
'industry': [u'\nColin Rees\n'],
'person': [u'\nColin Rees\n'],
'url': [u'/cio100/2013/dominos-pizza/']}
Ideally I want only one block, not two, with Domino's in the company slot, Colin in the person slot, and the industry grabbed, which it's not doing.
When I use firebug to inspect the table, I see h2 for columns 1 and 2 (company and person) but column 3 is h3?
When I modify the tables line to h3 at the end, as follows
tables = sel.xpath('//table[@class="bgWhite listTable"]//h3')
I get this
2014-01-13 22:16:46-0500 [scrape] DEBUG: Scraped from <200 http://www.cio.co.uk/cio100/2013/cio/>
{'company': [u'\nRetail\n'],
'industry': [u'\nRetail\n'],
'person': [u'\nRetail\n'],
'url': [u'/cio100/2013/dominos-pizza/']}
Here it only produces 1 block, and it's capturing Industry and the URL correctly. But it's not getting the company name or person.
Any help will be greatly appreciated!
Thanks!
Upvotes: 1
Views: 5856
Reputation: 11396
as far as the xpath goes, consider doing something like:
$ scrapy shell http://www.cio.co.uk/cio100/2013/cio/
...
>>> for tr in sel.xpath('//table[@class="bgWhite listTable"]/tr'):
... item = Cio100Item()
... item['company'] = tr.xpath('td[2]//a/text()').extract()[0].strip()
... item['person'] = tr.xpath('td[3]//a/text()').extract()[0].strip()
... item['industry'] = tr.xpath('td[4]//a/text()').extract()[0].strip()
... item['url'] = tr.xpath('td[4]//a/@href').extract()[0].strip()
... print item
...
{'company': u'LOCOG',
'industry': u'Leisure and entertainment',
'person': u'Gerry Pennell',
'url': u'/cio100/2013/locog/'}
{'company': u'Laterooms.com',
'industry': u'Leisure and entertainment',
'person': u'Adam Gerrard',
'url': u'/cio100/2013/lateroomscom/'}
{'company': u'Vodafone',
'industry': u'Communications and IT services',
'person': u'Albert Hitchcock',
'url': u'/cio100/2013/vodafone/'}
...
other than that you better yield
items one by one rather than accumulating them in a list
Upvotes: 3