Reputation: 93
I have a script that scrapes data by URLslist.
This script is executing in a docker container.
I would like to run it in multiple instances, for example, 20.
For that, I wanted to use docker-compose scale worker=20
and to pass the INDEX to each instance so that the script knows which URLs should be scraped .
Example.
ID, URL
0 https://example.org/sdga2
1 https://example.org/fsdh34
2 https://example.org/fs4h35
3 https://example.org/f1h36
4 https://example.org/fs4h37
...
If there are 3 instances, 1st instance of script should process a url whose ID equals to 0, 3, 6, 9 i.e. ID = INDEX + INSTANCES_NUM * k.
I don't know how to pass INDEX to script running in Docker container. Of course, I can duplicate services in docker-compose.yml with different INDEX in environment vars. But if instances number is greater 10 or even 50 it will be a very bad solution)
Does anyone know how do this?
Upvotes: 5
Views: 5523
Reputation: 263577
With docker-compose
, I don't believe there's any support for this. However, with swarm mode, which can use a similar compose file, you can pass {{.Task.Slot}}
as an environment variable using service templates. E.g.
version: '3'
services:
test:
image: busybox
command: /bin/sh -c "echo My task number is $$task_id && tail -f /dev/null"
environment:
task_id: "{{.Task.Slot}}"
deploy:
replicas: 5
Instead of docker-compose up
, I deploy with docker stack deploy -c docker-compose.yml test
. My local swarm cluster is just a single node created with docker swarm init
.
Then, reviewing each of these running containers:
$ docker ps --filter label=com.docker.swarm.service.name=test_test
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
ccd0dbebbcbe busybox:latest "/bin/sh -c 'echo My…" About a minute ago Up About a minute test_test.3.i3jg6qrg09wjmntq1q17690q4
bfaa22fa3342 busybox:latest "/bin/sh -c 'echo My…" About a minute ago Up About a minute test_test.5.iur5kg6o3hn5wpmudmbx3gvy1
a372c0ce39a2 busybox:latest "/bin/sh -c 'echo My…" About a minute ago Up About a minute test_test.4.rzmhyjnjk00qfs0ljpfyyjz73
0b47d19224f6 busybox:latest "/bin/sh -c 'echo My…" About a minute ago Up About a minute test_test.1.tm97lz6dqmhl80dam6bsuvc8j
c968cb5dbb5f busybox:latest "/bin/sh -c 'echo My…" About a minute ago Up About a minute test_test.2.757e8evknx745120ih5lmhk34
$ docker ps --filter label=com.docker.swarm.service.name=test_test -q | xargs -n 1 docker logs
My task number is 3
My task number is 5
My task number is 4
My task number is 1
My task number is 2
Upvotes: 5
Reputation: 164
why don't you use a database? mysql or redis.
each container can fetch urls from the database and you can mark fetched urls as complete, always fetch not completed urls from each container. This can scale.
Upvotes: -1