user5436441
user5436441

Reputation:

How to run a Scrapy spider from AWS Lambda?

I'm trying to run a scrapy spider from within AWS Lambda. Here is what my current script looks like, which is scraping test data.

import boto3
import scrapy
from scrapy.crawler import CrawlerProcess

s3 = boto3.client('s3')
BUCKET = 'sample-bucket'

class BookSpider(scrapy.Spider):
    name = 'bookspider'
    start_urls = [
        'http://books.toscrape.com/'
    ]

    def parse(self, response):
        for link in response.xpath('//article[@class="product_pod"]/div/a/@href').extract():
            yield response.follow(link, callback=self.parse_detail)
        next_page = response.xpath('//li[@class="next"]/a/@href').extract_first()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

    def parse_detail(self, response):
        title = response.xpath('//div[contains(@class, "product_main")]/h1/text()').extract_first()
        price = response.xpath('//div[contains(@class, "product_main")]/'
                               'p[@class="price_color"]/text()').extract_first()
        availability = response.xpath('//div[contains(@class, "product_main")]/'
                                      'p[contains(@class, "availability")]/text()').extract()
        availability = ''.join(availability).strip()
        upc = response.xpath('//th[contains(text(), "UPC")]/'
                             'following-sibling::td/text()').extract_first()
        yield {
            'title': title,
            'price': price,
            'availability': availability,
            'upc': upc
        }

def main(event, context):
    process = CrawlerProcess({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
        'FEED_FORMAT': 'json',
        'FEED_URI': 'result.json'
    })

    process.crawl(BookSpider)
    process.start() # the script will block here until the crawling is finished

    data = open('result.json', 'rb')
    s3.put_object(Bucket = BUCKET, Key='result.json', Body=data)
    print('All done.')

if __name__ == "__main__":
    main('', '')

I first locally tested this script and it was running as normal, scraping the data and saving it to a 'results.json', and then uploading it to my S3 bucket.

Then, I configured my AWS Lambda function by following the guide here: https://serverless.com/blog/serverless-python-packaging/ and it successfully imports the Scrapy library within AWS Lambda for execution.

However, when the script is run on AWS Lambda, it does not scrape data and simply throws an error for results.json does not exist

Anyone who has configured running Scrapy or has a workaround or can point me in the right direction would be highly appreciated.

Thanks.

Upvotes: 14

Views: 11351

Answers (1)

Joe
Joe

Reputation: 326

Just came across this whilst looking for something else, but off the top of my head..

Lambdas provide temp storage in /tmp, so I would suggest setting

'FEED_URI': '/tmp/result.json'

And then

data = open('/tmp/result.json', 'rb')

There are likely all sorts of best practices around using temp storage in lambdas, so I'd suggest spending a bit of time reading up on those.

Upvotes: 7

Related Questions