Reputation:
I'm trying to run a scrapy spider from within AWS Lambda. Here is what my current script looks like, which is scraping test data.
import boto3
import scrapy
from scrapy.crawler import CrawlerProcess
s3 = boto3.client('s3')
BUCKET = 'sample-bucket'
class BookSpider(scrapy.Spider):
name = 'bookspider'
start_urls = [
'http://books.toscrape.com/'
]
def parse(self, response):
for link in response.xpath('//article[@class="product_pod"]/div/a/@href').extract():
yield response.follow(link, callback=self.parse_detail)
next_page = response.xpath('//li[@class="next"]/a/@href').extract_first()
if next_page:
yield response.follow(next_page, callback=self.parse)
def parse_detail(self, response):
title = response.xpath('//div[contains(@class, "product_main")]/h1/text()').extract_first()
price = response.xpath('//div[contains(@class, "product_main")]/'
'p[@class="price_color"]/text()').extract_first()
availability = response.xpath('//div[contains(@class, "product_main")]/'
'p[contains(@class, "availability")]/text()').extract()
availability = ''.join(availability).strip()
upc = response.xpath('//th[contains(text(), "UPC")]/'
'following-sibling::td/text()').extract_first()
yield {
'title': title,
'price': price,
'availability': availability,
'upc': upc
}
def main(event, context):
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'FEED_FORMAT': 'json',
'FEED_URI': 'result.json'
})
process.crawl(BookSpider)
process.start() # the script will block here until the crawling is finished
data = open('result.json', 'rb')
s3.put_object(Bucket = BUCKET, Key='result.json', Body=data)
print('All done.')
if __name__ == "__main__":
main('', '')
I first locally tested this script and it was running as normal, scraping the data and saving it to a 'results.json', and then uploading it to my S3 bucket.
Then, I configured my AWS Lambda function by following the guide here: https://serverless.com/blog/serverless-python-packaging/ and it successfully imports the Scrapy library within AWS Lambda for execution.
However, when the script is run on AWS Lambda, it does not scrape data and simply throws an error for results.json does not exist
Anyone who has configured running Scrapy or has a workaround or can point me in the right direction would be highly appreciated.
Thanks.
Upvotes: 14
Views: 11351
Reputation: 326
Just came across this whilst looking for something else, but off the top of my head..
Lambdas provide temp storage in /tmp, so I would suggest setting
'FEED_URI': '/tmp/result.json'
And then
data = open('/tmp/result.json', 'rb')
There are likely all sorts of best practices around using temp storage in lambdas, so I'd suggest spending a bit of time reading up on those.
Upvotes: 7