Reputation: 10729
I'm trying to get Selenium-Wire to work in an AWS Lambda. I've seen very few StackOverflow entries about it, but it kinda seems some people were successful. My lambda is stateless and doesn't even need to use any other AWS feature (such as S3). It'd scrape a certain thing an d I'd capture a specific JSON response of a specific AJAX call on a page.
Here is my Dockerfile
:
FROM public.ecr.aws/lambda/python:3.9
# Should I go with python:3.8 instead?
# Install the function's dependencies using file requirements.txt
# from your project folder.
RUN yum makecache
# https://stackoverflow.com/questions/73056540/no-module-named-amazon-linux-extras-when-running-amazon-linux-extras-install-epe
RUN yum install -y amazon-linux-extras
# https://stackoverflow.com/questions/72077341/how-do-you-install-chrome-on-amazon-linux-2
RUN PYTHON=python2 amazon-linux-extras install epel -y
# https://stackoverflow.com/questions/72850004/no-package-zbar-available-in-lambda-layer
RUN yum makecache
RUN yum install -y chromium
ENV CHROMIUM_PATH=/usr/bin/chromium-browser
# or RUN yum install -y google-chrome-stable
# or https://intoli.com/blog/installing-google-chrome-on-centos/
# curl https://intoli.com/install-google-chrome.sh | bash
# https://devopsqa.wordpress.com/2018/03/08/install-google-chrome-and-chromedriver-in-amazon-linux-machine/
# https://www.usessionbuddy.com/post/How-To-Install-Selenium-Chrome-On-Centos-7/
RUN yum install -y chromedriver
RUN pip install --upgrade pip
COPY requirements.txt .
RUN pip3 install -r requirements.txt --target "${LAMBDA_TASK_ROOT}"
# Copy function code
COPY app.py ${LAMBDA_TASK_ROOT}
# Set the CMD to your handler (could also be done as a parameter override outside of the Dockerfile)
CMD [ "app.handler" ]
My requirements.txt
, pretty minimal:
selenium-wire==5.1.0
And my Lambda function:
from seleniumwire import webdriver
from selenium.webdriver.chrome.service import Service
def handler(event, context):
# https://gist.github.com/rengler33/f8b9d3f26a518c08a414f6f86109863c
# https://github.com/wkeeling/selenium-wire/issues/131
chrome_options = webdriver.ChromeOptions()
chrome_option_list = {
"disable-extensions",
"disable-gpu",
"no-sandbox",
"headless", # for Jenkins
"disable-dev-shm-usage", # Jenkins
"window-size=800x600", # Jenkins
"window-size=800,600",
"disable-setuid-sandbox",
"allow-insecure-localhost",
"no-cache",
"user-data-dir=/tmp/user-data",
"hide-scrollbars",
"enable-logging",
"log-level=0",
"single-process",
"data-path=/tmp/data-path",
"ignore-certificate-errors",
"homedir=/tmp",
"disk-cache-dir=/tmp/cache-dir",
"start-maximized",
"disable-software-rasterizer",
"ignore-certificate-errors-spki-list",
"ignore-ssl-errors",
}
for chrome_option in chrome_option_list:
chrome_options.add_argument(f"--{chrome_option}")
selenium_options = {
"request_storage_base_dir": "/tmp", # Use /tmp to store captured data
"exclude_hosts": ""
}
ser = Service("/usr/bin/chromedriver")
ser.service_args=["--verbose", "--log-path=test.log"]
driver = webdriver.Chrome(service=ser, options=chrome_options, seleniumwire_options=selenium_options)
# The meat
# ...
return result
I built an image from the docker file and uploaded it to AWS ECR. The Docker image passes the "it works on my machine (TM)" classic test: it scrapes fine in my laptop Docker container. However it returns error when I try to run it as lambda (based on my own image):
START RequestId: 3f767106-e6f5-4c5c-8930-e77b7314eb3b Version: $LATEST
[ERROR] WebDriverException: Message: Service /usr/bin/chromedriver unexpectedly exited. Status code was: 1
Traceback (most recent call last):
File "/var/task/app.py", line 43, in handler
driver = webdriver.Chrome(service=ser, options=chrome_options, seleniumwire_options=selenium_options)
File "/var/task/seleniumwire/webdriver.py", line 218, in __init__
super().__init__(*args, **kwargs)
File "/var/task/selenium/webdriver/chrome/webdriver.py", line 80, in __init__
super().__init__(
File "/var/task/selenium/webdriver/chromium/webdriver.py", line 101, in __init__
self.service.start()
File "/var/task/selenium/webdriver/common/service.py", line 104, in start
self.assert_process_still_running()
File "/var/task/selenium/webdriver/common/service.py", line 117, in assert_process_still_running
raise WebDriverException(f"Service {self.path} unexpectedly exited. Status code was: {return_code}")
END RequestId: 3f767106-e6f5-4c5c-8930-e77b7314eb3b
REPORT RequestId: 3f767106-e6f5-4c5c-8930-e77b7314eb3b Duration: 758.10 ms Billed Duration: 1361 ms Memory Size: 128 MB Max Memory Used: 91 MB Init Duration: 602.74 ms
I was also experimenting with other Chrome switches such as mentioned in selenium.common.exceptions.webdriverexception: message: 'chromedriver.exe' unexpectedly exited.status code was: 1 with no luck. I always get Status code 1, but I couldn't find any documentation what is that exactly. I assume it's some very blatant error.
Does anyone have a working image / Dockerfile + skeleton function I can try?
Upvotes: 1
Views: 339
Reputation: 51
Well, I dont have authentic reasons about dockerfile since i had to do a lot try-hit process but yes it is working as intended :
FROM public.ecr.aws/lambda/python:3.11 as build
RUN yum install -y \
wget \
unzip \
ca-certificates && \
update-ca-trust && \
curl -Lo "/tmp/chromedriver-linux64.zip" "https://storage.googleapis.com/chrome-for-testing-public/122.0.6261.111/linux64/chromedriver-linux64.zip" && \
curl -Lo "/tmp/chrome-linux64.zip" "https://storage.googleapis.com/chrome-for-testing-public/122.0.6261.111/linux64/chrome-linux64.zip" && \
unzip /tmp/chromedriver-linux64.zip -d /opt/ && \
unzip /tmp/chrome-linux64.zip -d /opt/
FROM public.ecr.aws/lambda/python:3.11
RUN yum install -y atk cups-libs gtk3 libXcomposite alsa-lib \
libXcursor libXdamage libXext libXi libXrandr libXScrnSaver \
libXtst pango at-spi2-atk libXt xorg-x11-server-Xvfb \
xorg-x11-xauth dbus-glib dbus-glib-devel nss mesa-libgbm
# Copy the custom CA certificate to the container
COPY ca.crt /etc/ssl/certs/ca.crt
# Update the system's CA certificates to include the custom certificate
RUN update-ca-trust
# Install Python dependencies
COPY requirements.txt ./
RUN pip install -r requirements.txt
COPY --from=build /opt/chrome-linux64 /opt/chrome
COPY --from=build /opt/chromedriver-linux64 /opt/
# Copy source code
COPY ./src ./
# Set the command to be executed when launching the container
CMD ["lambda_trigger.handler"]
>>>> As far as Driver setup is concerned ,you have to be very careful around options add argument as well as seleniumwire_options. Pls try this working script:
from seleniumwire import webdriver
from selenium.webdriver.chrome.service import Service
from fake_useragent import UserAgent
from tempfile import mkdtemp
# add the proxy address to proxy options
proxy_options = {
'proxy': {
'https': f'https://{PROXY_USERNAME}:{PROXY_PASSWORD}@{PROXY_HOST}:{PROXY_PORT}',
'http': f'http://{PROXY_USERNAME}:{PROXY_PASSWORD}@{PROXY_HOST}:{PROXY_PORT}',
'no_proxy': 'localhost,127.0.0.1' # Bypass localhost
},
'verify_ssl': False,
'disable_encoding': True, # Optional: Disable encoding to save on bandwidth
'request_storage_base_dir': f'{mkdtemp()}', # Use /tmp to store captured data
'exclude_hosts': ''
}
# Path to ChromeDriver in Lambda
chrome_driver_path = "/opt/chromedriver"
print(f'<< {chrome_driver_path} >>')
def driver_setup():
options = webdriver.ChromeOptions() # for selenium/selenium-wire
service = Service(executable_path=chrome_driver_path)
options.binary_location = '/opt/chrome/chrome'
options.add_argument('--headless')
options.add_argument(f"--user-agent={user_agent}")
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--disable-extensions')
options.add_argument('--no-sandbox')
options.add_argument('--no-cache')
options.add_argument('--disable-gpu')
options.add_argument('--window-size=1024x768')
options.add_argument(f'--user-data-dir={mkdtemp()}')
options.add_argument('--hide-scrollbars')
options.add_argument('--enable-logging')
options.add_argument("--single-process")
options.add_argument('--log-level=0')
options.add_argument(f'--data-path={mkdtemp()}')
options.add_argument('--ignore-certificate-errors')
options.add_argument(f'--homedir={mkdtemp()}')
options.add_argument(f'--disk-cache-dir={mkdtemp()}')
driver = webdriver.Chrome(service=service, options=options, seleniumwire_options=proxy_options)
return driver
Result:
Upvotes: 0