user3753592
user3753592

Reputation: 1

Scrapy: spider returns nothing

this is my first time creating a spider and in spite my efforts it continues to return nothing to my csv export. My code is:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector

class Emag(CrawlSpider):
    name = "emag"
    allowed_domains = ["emag.ro"]
    start_urls = [
        "http://www.emag.ro/"]

    rules = (Rule(SgmlLinkExtractor(allow=(r'www.emag.ro')), callback="parse", follow= True))

    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//a/@href').extract()
        for site in sites:
            site = str(site)

        for clean_site in site:
            name = clean_site.xpath('//[@id=""]/span').extract()
            return name

the thing is that if i print the sites, it bring me a list of the URLs, which is OK. if i search for the name inside one of the URLs in scrapy shell, it will find it. the problem is when i what all the names in all links crawled.I run it with "scrapy crawl emag>emag.csv"

Can you please give me a hint whats wrong?

Upvotes: 0

Views: 1843

Answers (2)

Gihan Gamage
Gihan Gamage

Reputation: 3374

One problem might be, you have been forbidden by robots.txt for the site You can check that from the log trace. If so go to your settings.py and make ROBOTSTXT_OBEY=False That solved my issue

Upvotes: 0

alecxe
alecxe

Reputation: 474171

Multiple problems in the spider:

  • rules should be an iterable, missing comma before the last parenthesis
  • no Items specified - you need to define an Item class and return/yield it from the spider parse() callback

Here's a fixed version of the spider:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Field, Item


class MyItem(Item):
    name = Field()


class Emag(CrawlSpider):
    name = "emag"
    allowed_domains = ["emag.ro"]
    start_urls = [
        "http://www.emag.ro/"]

    rules = (Rule(SgmlLinkExtractor(allow=(r'www.emag.ro')), callback="parse", follow=True), )

    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//a/@href')
        for site in sites:
            item = MyItem()
            item['name'] = site.xpath('//[@id=""]/span').extract()
            yield item

Upvotes: 1

Related Questions