hh54188
hh54188

Reputation: 15626

Confused about running Scrapy from within a Python script

Following document, I can run scrapy from a Python script, but I can't get the scrapy result.

This is my spider:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from items import DmozItem

class DmozSpider(BaseSpider):
    name = "douban" 
    allowed_domains = ["example.com"]
    start_urls = [
        "http://www.example.com/group/xxx/discussion"
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        rows = hxs.select("//table[@class='olt']/tr/td[@class='title']/a")
        items = []
        # print sites
        for row in rows:
            item = DmozItem()
            item["title"] = row.select('text()').extract()[0]
            item["link"] = row.select('@href').extract()[0]
            items.append(item)

        return items

Notice the last line, I try to use the returned parse result, if I run:

 scrapy crawl douban

the terminal could print the return result

But I can't get the return result from the Python script. Here is my Python script:

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log, signals
from spiders.dmoz_spider import DmozSpider
from scrapy.xlib.pydispatch import dispatcher

def stop_reactor():
    reactor.stop()
dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = DmozSpider(domain='www.douban.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
log.msg("------------>Running reactor")
result = reactor.run()
print result
log.msg("------------>Running stoped")

I try to get the result at the reactor.run(), but it return nothing,

How can I get the result?

Upvotes: 6

Views: 6352

Answers (3)

Ixio
Ixio

Reputation: 527

I found your question while asking myself the same thing, namely: "How can I get the result?". Since this wasn't answered here I endeavoured to find the answer myself and now that I have I can share it:

items = []
def add_item(item):
    items.append(item)
dispatcher.connect(add_item, signal=signals.item_passed)

Or for scrapy 0.22 (http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script) replace the last line of my solution by:

crawler.signals.connect(add_item, signals.item_passed)

My solution is freely adapted from http://www.tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/.

Upvotes: 5

Zubair Alam
Zubair Alam

Reputation: 8937

in my case, i placed the script file at scrapy project level e.g. if scrapyproject/scrapyproject/spiders then i placed it at scrapyproject/myscript.py

Upvotes: 0

alecxe
alecxe

Reputation: 473763

Terminal prints the result because the default log level is set to DEBUG.

When you are running your spider from the script and call log.start(), the default log level is set to INFO.

Just replace:

log.start()

with

log.start(loglevel=log.DEBUG)

UPD:

To get the result as string, you can log everything to a file and then read from it, e.g.:

log.start(logfile="results.log", loglevel=log.DEBUG, crawler=crawler, logstdout=False)

reactor.run()

with open("results.log", "r") as f:
    result = f.read()
print result

Hope that helps.

Upvotes: 8

Related Questions