Reputation: 737
I currently have an application that (for the sake of simplicity) just requires a .csv
file. However, this file needs to be constructed with a script, let's call it create_db.py
.
I have two images (let's call them API1 and API2) that require the .csv
file so I declared a 2-stage build and both images copy the .csv
into their filesystem. This makes the Dockerfiles somewhat ugly as API1 and API2 have the same first lines of the Dockerfile plus there is no guarantee that both images have the same .csv
because it is constructed "on the fly"
I have two possible solutions to this problem:
First option:
create_db.py
and then tag it as data:latest
. Copy the .csv
in API1 and API2 doing FROM data:latest as datapipeline
FROM continuumio/miniconda3:4.7.12
...
...
COPY --from=datapipeline file.csv .
Then I will need to create a bash file to make sure data:latest
is built (and up to date) before building API1 and API2.
Pros: Data can be pulled from a repository if you are in a different machine, no need to "rebuilt it" again.
Cons: Every time I build API1 and API2 I need to make sure that data:latest
is up to date. API1 and API2 require data:latest
to be used.
Second option:
data/
and an image that runs create_db.py
and mount the volume so the .csv
is in data/
. Then mount the volume for API1 and API2. I will also need some kind of mechanism that makes sure that data/
contains the required file.Mounting volumes sounds like the right choice when dealing with shared data, but in this case, I am not sure because my data needs "to be built" before being able to be used. Should I go with the first option then?
Chosen solution, thanks to @David Maze
What I ended up doing is separating the data pipeline in its own Docker image and then COPY from that image in API1 and API2.
To make sure that API1 and API2 always have the latest "data image" versión, the data pipeline calculate the hashes of all output files, then tries to do docker pull data:<HASH>
if it fails it means that this version of the data is not in the registry and the data image is tagged as both data:<HASH>
and data:latest
and pushed to the registry. This guarantees that data:latest
always points to the last data pushed to the registry and at the same time I can keep track of all the data:<HASH>
versions
Upvotes: 2
Views: 98
Reputation: 159697
If it’s manageable size-wise, I’d prefer baking it into the image. There’s two big reasons for this: it makes it possible to just docker run
the image without any external host dependencies, and it works much better in cluster environments (Docker Swarm, Kubernetes) where sharing files can be problematic.
There’s two more changes you can make to this to improve your proposed Dockerfile. You can pass the specific version of the dataset you’re using as an ARG
, which will help the situation where you need to build two copies of the image and need them to have the same dataset. You can also directly COPY --from=
an image, without needing to declare it as a stage.
FROM continuumio/miniconda3:4.7.12
ARG data_version=latest
COPY --from=data:${data_version} file.csv .
I’d consider the volume approach only if the data file is really big (gigabytes). Docker images start to get unwieldy at that size, so if you have a well-defined auxiliary data set you can break out, that will help things run better. Another workable approach could be to store the datafile somewhere remote like an AWS S3 bucket, and download it at startup time (adds some risk of startup-time failure and increases the startup time, but leaves the image able to start autonomously).
Upvotes: 2