Reputation: 13515
This is the code for Spyder1 that I've been trying to write within Scrapy framework:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from firm.items import FirmItem
class Spider1(CrawlSpider):
domain_name = 'wc2'
start_urls = ['http://www.whitecase.com/Attorneys/List.aspx?LastName=A']
rules = (
Rule(SgmlLinkExtractor(allow=["hxs.select(
'//td[@class='altRow'][1]/a/@href').re('/.a\w+')"]),
callback='parse'),
)
def parse(self, response):
hxs = HtmlXPathSelector(response)
JD = FirmItem()
JD['school'] = hxs.select(
'//td[@class="mainColumnTDa"]').re('(?<=(JD,\s))(.*?)(\d+)'
)
return JD
SPIDER = Spider1()
The regex in the rules
successfully pulls all the bio urls that I want from the start url:
>>> hxs.select(
... '//td[@class="altRow"][1]/a/@href').re('/.a\w+')
[u'/cabel', u'/jacevedo', u'/jacuna', u'/aadler', u'/zahmedani', u'/tairisto', u
'/zalbert', u'/salberts', u'/aaleksandrova', u'/malhadeff', u'/nalivojvodic', u'
/kallchurch', u'/jalleyne', u'/lalonzo', u'/malthoff', u'/valvarez', u'/camon',
u'/randerson', u'/eandreeva', u'/pangeli', u'/jangland', u'/mantczak', u'/darany
i', u'/carhold', u'/marora', u'/garrington', u'/jartzinger', u'/sasayama', u'/ma
sschenfeldt', u'/dattanasio', u'/watterbury', u'/jaudrlicka', u'/caverch', u'/fa
yanruoh', u'/razar']
>>>
But when I run the code I get
[wc2] ERROR: Error processing FirmItem(school=[]) -
[Failure instance: Traceback: <type 'exceptions.IndexError'>: list index out of range
This is the FirmItem in Items.py
from scrapy.item import Item, Field
class FirmItem(Item):
school = Field()
pass
Can you help me understand where the index error occurs?
It seems to me that it has something to do with SgmLinkExtractor.
I've been trying to make this spider work for weeks with Scrapy. They have an excellent tutorial but I am new to python and web programming so I don't understand how for instance SgmlLinkExtractor
works behind the scene.
Would it be easier for me to try to write a spider with the same simple functionality with Python libraries? I would appreciate any comments and help.
Thanks
Upvotes: 2
Views: 2804
Reputation: 13515
I also tried to put the names scraped from the initial url into a list and then pass each name to parse in the form of absolute url as http://www.whitecase.com/aabbas
(for /aabbas).
The following code loops over the list, but I don't know how to pass this to parse . Do you think this is a better idea?
baseurl = 'http://www.whitecase.com'
names = ['aabbas', '/cabel', '/jacevedo', '/jacuna', '/igbadegesin']
def makeurl(baseurl, names):
for x in names:
url = baseurl + x
baseurl = 'http://www.whitecase.com'
x = ''
return url
Upvotes: 0
Reputation: 1906
The parse function is called for each match of your SgmlLinkExtractor.
As Pablo mentioned you want to simplify your SgmlLinkExtractor.
Upvotes: 0
Reputation: 1540
SgmlLinkExtractor doesn't support selectors in its "allow" argument.
So this is wrong:
SgmlLinkExtractor(allow=["hxs.select('//td[@class='altRow'] ...')"])
This is right:
SgmlLinkExtractor(allow=[r"product\.php"])
Upvotes: 1