Reputation: 67

AWS Lambda - scrapy library is not working (cannot import name certificate_transparency)

I want to use an AWS Lambda to scrape a website. The crawler code is in Python and using the Scrapy library, provided by Pip.

To run the lambda function I had to create a zip of dependencies (here only scrapy) in public Amazon Linux AMI version - amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2, as per their documentation here, add the lambda function and upload it to create the lambda function.

Now, when I invoke the lambda function it gives me the following error:

cannot import name certificate_transparency: ImportError
Traceback (most recent call last):
  File "/var/task/my_lambda_function.py", line 120, in my_lambda_handler
    return get_data_from_scrapy(username, password)
  File "/var/task/my_lambda_function.py", line 104, in get_data_from_scrapy
    process.crawl(MyScrapyFunction)
  File "/var/task/scrapy/crawler.py", line 167, in crawl
    crawler = self.create_crawler(crawler_or_spidercls)
  File "/var/task/scrapy/crawler.py", line 195, in create_crawler
    return self._create_crawler(crawler_or_spidercls)
  File "/var/task/scrapy/crawler.py", line 200, in _create_crawler
    return Crawler(spidercls, self.settings)
  File "/var/task/scrapy/crawler.py", line 52, in __init__
    self.extensions = ExtensionManager.from_crawler(self)
  File "/var/task/scrapy/middleware.py", line 58, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "/var/task/scrapy/middleware.py", line 34, in from_settings
    mwcls = load_object(clspath)
  File "/var/task/scrapy/utils/misc.py", line 44, in load_object
    mod = import_module(module)
  File "/usr/lib64/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
  File "/var/task/scrapy/extensions/memusage.py", line 16, in <module>
    from scrapy.mail import MailSender
  File "/var/task/scrapy/mail.py", line 22, in <module>
    from twisted.internet import defer, reactor, ssl
  File "/var/task/twisted/internet/ssl.py", line 59, in <module>
    from OpenSSL import SSL
  File "/var/task/OpenSSL/__init__.py", line 8, in <module>
    from OpenSSL import crypto, SSL
  File "/var/task/OpenSSL/crypto.py", line 12, in <module>
    from cryptography import x509
  File "/var/task/cryptography/x509/__init__.py", line 7, in <module>
    from cryptography.x509 import certificate_transparency
ImportError: cannot import name certificate_transparency

Following are the dependencies/libraries version (all are latest) that I'm using:

pip 9.0.1
Scrapy==1.4.0
pyOpenSSL==17.5.0
lxml==4.1.1
cryptography==2.1.4

Any help would be appreciated. Thanks in advance.

Upvotes: 2

Answers (3)

Mo Hajr

Reputation: 1332

As Ivan mentioned the issue here arises from the required c dependencies for the python packages

Fortunately, AWS published an amazonlinux Docker image that is nearly identical to the AMI that Lambda functions use, here is an article that i used myself and elaborate that in more detail.

Here is my docker configuration that i used to build my Scrapy project and package it for lambda

FROM amazonlinux:latest
RUN yum -y install git \
    gcc \
    openssl-devel \
    bzip2-devel \
    libffi \
    libffi-devel \
    python3-devel \
    python37 \
    zip \
    unzip \
    && yum clean all

RUN python3 -m pip install --upgrade pip 

COPY src /io

CMD sh /io/package.sh

and here is the package.sh file

#!/bin/bash

mkdir holder 
python3 -m pip install scrapy OTHER-REPOS -t holder
rm -f /packages/lambda.zip
cp -r /io/* holder
cd holder
zip -r /packages/lambda.zip *

and this how I build the image and run it with a volume to get the deployment package zip file after it finishes

docker build -t TAG_NAME_HERE .
docker run --rm -v ${PWD}/deployment_package:/packages -t TAG_NAME_HERE

hope this can help.

Upvotes: 0

Ivan Peng

Reputation: 609

I don't know if you ever ended up solving this, but the issue arises from the lxml library. It requires C dependencies to build properly, which give lambda a plethora of problems since they're dependent on the OS. I'm deploying scrapy through serverless AWS, and I used two things to solve it: the serverless-python-requirements plugin and dockerizePip: non-linux setting. This forces the serverless to build the package in a docker container, which provides the correct binaries. Note that this is also the solution for getting NumPy, SciPy, Pandas, etc. in addition to lxml to work on AWS Lambda. Here's a blog that I followed to get it working: https://serverless.com/blog/serverless-python-packaging/

Serverless is nice if you don't want to deal with making the zip file yourself. If you do, here's a stack overflow link that shows how you can solve the problem with lxml: AWS Lambda not importing LXML

Upvotes: 5

GoTrained

Reputation: 303

I would not use AWS Lambda for such complicated tasks. Why did you choose it? If because it is free, you have several better options:

AWS gives a one-year free access to all its services for new accounts.
AWS Lightsail gives you a free month for the minimum plan.
PythonAnywhere.com offers you a free account. I tried Scrapy on PythonAnywhere and it works perfectly. Just please note that the "continuous" running time is up to 2 hours for free accounts and 6 hours for paid accounts (according to their Support).
ScrapingHub.com gives you one free crawler. Check the video called "Deploying Scrapy Spider to ScrapingHub" - the video is available for free preview under this course "Scrapy: Powerful Web Scraping & Crawling with Python".

I hope this helps. If you have questions, please let me know.

Upvotes: 5

AWS Lambda - scrapy library is not working (cannot import name certificate_transparency)

Answers (3)

Related Questions