Ostap Didenko
Ostap Didenko

Reputation: 484

Pass arguments to scrapy spider through docker run

I have a scrapy+Selenium spider packaged in a docker container. I want to run that container with passing some aruments to the spider. However, for some reason I receive a strange error message. I did an extensive search and tried many different options before submitting the question.

Dockerfile

FROM python:2.7

# install google chrome
RUN wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add -
RUN sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list'
RUN apt-get -y update
RUN apt-get install -y google-chrome-stable

# install chromedriver
RUN apt-get install -yqq unzip
RUN wget -O /tmp/chromedriver.zip http://chromedriver.storage.googleapis.com/`curl -sS chromedriver.storage.googleapis.com/LATEST_RELEASE`/chromedriver_linux64.zip
RUN unzip /tmp/chromedriver.zip chromedriver -d /usr/local/bin/

# install xvfb
RUN apt-get install -yqq xvfb

# install pyvirtualdisplay
RUN pip install pyvirtualdisplay

# set display port and dbus env to avoid hanging
ENV DISPLAY=:99
ENV DBUS_SESSION_BUS_ADDRESS=/dev/null

#install scrapy
RUN pip install --upgrade pip && \
    pip install --upgrade \
        setuptools \
        wheel && \
    pip install --upgrade scrapy

# install selenium
RUN pip install selenium==3.8.0

# install xlrd
RUN pip install xlrd

# install bs4
RUN pip install beautifulsoup4

ADD . /tralala/

WORKDIR tralala/
CMD scrapy crawl personel_spider_mpc -a chunksNo=$chunksNo -a chunkI=$chunkI

I guess that the problem may be in CMD part.

Spider init part:

class Crawler(scrapy.Spider):

    name = "personel_spider_mpc"

    allowed_domains = ['tralala.de',]

    def __init__(self, vdisplay = True, **kwargs):
        super(Crawler, self).__init__(**kwargs)
        self.chunkI = chunkI
        self.chunksNo = chunksNo

How I run the container:

docker run --env chunksNo='10' --env chunkI='1' ostapp/tralala

I tried with both quotations marks and without them

The error message:

2018-04-04 16:42:32 [twisted] CRITICAL: 
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 98, in crawl
    six.reraise(*exc_info)
  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 79, in crawl
    self.spider = self._create_spider(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 102, in _create_spider
    return self.spidercls.from_crawler(self, *args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 51, in from_crawler
    spider = cls(*args, **kwargs)
  File "/tralala/tralala/spiders/tralala_spider_mpc.py", line 673, in __init__
    self.chunkI = chunkI
NameError: global name 'chunkI' is not defined

Upvotes: 1

Views: 837

Answers (1)

Seer.The
Seer.The

Reputation: 487

Your arguments are stored in kwargs, which is just a dictionary, with key acting as argument name and value as argument value. It does not define names for you, so you get your error.

For more details, see this answer

In your specific case, try self.chunkI = kwargs['chunkI'] and self.chunksNo = kwargs['chunksNo']

Upvotes: 1

Related Questions