Reputation: 1
this is my first time creating a spider and in spite my efforts it continues to return nothing to my csv export. My code is:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
class Emag(CrawlSpider):
name = "emag"
allowed_domains = ["emag.ro"]
start_urls = [
"http://www.emag.ro/"]
rules = (Rule(SgmlLinkExtractor(allow=(r'www.emag.ro')), callback="parse", follow= True))
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//a/@href').extract()
for site in sites:
site = str(site)
for clean_site in site:
name = clean_site.xpath('//[@id=""]/span').extract()
return name
the thing is that if i print the sites, it bring me a list of the URLs, which is OK. if i search for the name inside one of the URLs in scrapy shell, it will find it. the problem is when i what all the names in all links crawled.I run it with "scrapy crawl emag>emag.csv"
Can you please give me a hint whats wrong?
Upvotes: 0
Views: 1843
Reputation: 3374
One problem might be, you have been forbidden by robots.txt for the site You can check that from the log trace. If so go to your settings.py and make ROBOTSTXT_OBEY=False That solved my issue
Upvotes: 0
Reputation: 474171
Multiple problems in the spider:
rules
should be an iterable, missing comma before the last parenthesisItem
s specified - you need to define an Item
class and return/yield it from the spider parse()
callbackHere's a fixed version of the spider:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Field, Item
class MyItem(Item):
name = Field()
class Emag(CrawlSpider):
name = "emag"
allowed_domains = ["emag.ro"]
start_urls = [
"http://www.emag.ro/"]
rules = (Rule(SgmlLinkExtractor(allow=(r'www.emag.ro')), callback="parse", follow=True), )
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//a/@href')
for site in sites:
item = MyItem()
item['name'] = site.xpath('//[@id=""]/span').extract()
yield item
Upvotes: 1