Ahsan aslam
Ahsan aslam

Reputation: 1199

Get Scrapy crawler output/results in script file function

I am using a script file to run a spider within scrapy project and spider is logging the crawler output/results. But i want to use spider output/results in that script file in some function .I did not want to save output/results in any file or DB. Here is Script code get from https://doc.scrapy.org/en/latest/topics/practices.html#run-from-script

from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings

configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner(get_project_settings())


d = runner.crawl('my_spider')
d.addBoth(lambda _: reactor.stop())
reactor.run()

def spider_output(output):
#     do something to that output

How can i get spider output in 'spider_output' method. It is possible to get output/results.

Upvotes: 16

Views: 11471

Answers (5)

Ahsan aslam
Ahsan aslam

Reputation: 1199

Here is the solution that get all output/results in a list

from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

from scrapy.signalmanager import dispatcher


def spider_results():
    results = []

    def crawler_results(signal, sender, item, response, spider):
        results.append(item)

    dispatcher.connect(crawler_results, signal=signals.item_scraped)

    process = CrawlerProcess(get_project_settings())
    process.crawl(MySpider)
    process.start()  # the script will block here until the crawling is finished
    return results


if __name__ == '__main__':
    print(spider_results())

Upvotes: 31

Kenny Aires
Kenny Aires

Reputation: 1438

It's going to return all the results of a Spider within a list.

from scrapyscript import Job, Processor
from scrapy.utils.project import get_project_settings


def get_spider_output(spider, **kwargs):
    job = Job(spider, **kwargs)
    processor = Processor(settings=get_project_settings())
    return processor.run([job])

Upvotes: 0

wiltonsr
wiltonsr

Reputation: 1027

This is an old question, but for future reference. If you are working with python 3.6+ I recommend using scrapyscript that allows you to run your Spiders and get the results in a super simple way:

from scrapyscript import Job, Processor
from scrapy.spiders import Spider
from scrapy import Request
import json

# Define a Scrapy Spider, which can accept *args or **kwargs
# https://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments
class PythonSpider(Spider):
    name = 'myspider'

    def start_requests(self):
        yield Request(self.url)

    def parse(self, response):
        title = response.xpath('//title/text()').extract()
        return {'url': response.request.url, 'title': title}

# Create jobs for each instance. *args and **kwargs supplied here will
# be passed to the spider constructor at runtime
githubJob = Job(PythonSpider, url='http://www.github.com')
pythonJob = Job(PythonSpider, url='http://www.python.org')

# Create a Processor, optionally passing in a Scrapy Settings object.
processor = Processor(settings=None)

# Start the reactor, and block until all spiders complete.
data = processor.run([githubJob, pythonJob])

# Print the consolidated results
print(json.dumps(data, indent=4))
[
    {
        "title": [
            "Welcome to Python.org"
        ],
        "url": "https://www.python.org/"
    },
    {
        "title": [
            "The world's leading software development platform \u00b7 GitHub",
            "1clr-code-hosting"
        ],
        "url": "https://github.com/"
    }
]

Upvotes: 8

d3p4n5hu
d3p4n5hu

Reputation: 421

My advice is to use the Python subprocess module to run spider from the script rather than using the method provided in the scrapy docs to run spider from python script. The reason for that is that with the subprocess module, you can capture the output/logs and even statements that you print from inside the spider.

In Python 3, execute the spider with the run method. Ex.

import subprocess
process = subprocess.run(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
if process.returncode == 0:
    result = process.stdout.decode('utf-8')
else:
    # code to check error using 'process.stderr'

Setting the stdout/stderr to subprocess.PIPE will allow capture of output so it's very important to set this flag. Here command should be a sequence or a string (It it's a string, then call the run method with 1 more param: shell=True). For example:

command = ['scrapy', 'crawl', 'website', '-a', 'customArg=blahblah']
# or
command = 'scrapy crawl website -a customArg=blahblah' # with shell=True
#or
import shlex
command = shlex.split('scrapy crawl website -a customArg=blahblah') # without shell=True

Also, process.stdout will contain the output from the script but it will be of type bytes. You need to convert it to str using decode('utf-8')

Upvotes: 0

Granitosaurus
Granitosaurus

Reputation: 21406

AFAIK there is no way to do this, since crawl():

Returns a deferred that is fired when the crawling is finished.

And the crawler doesn't store results anywhere other than outputting them to logger.

However returning ouput would conflict with the whole asynchronious nature and structure of scrapy, so saving to file then reading it is a prefered approach here.
You can simply devise pipeline that saves your items to file and simply read the file in your spider_output. You will receive your results since reactor.run() is blocking your script untill the output file is complete anyways.

Upvotes: 1

Related Questions