Reputation: 523
I am trying to use cache in my .gitlab-ci.yml file, but the time only increases (testing by adding blank lines). I want to cache python packages I install with pip. Here is the stage where I install and use these packages (other stages uses Docker):
image: python:3.8-slim-buster
variables:
PIP_CACHE_DIR: "$CI_PROJECT_DIR/.cache/pip"
cache:
paths:
- .cache/pip
stages:
- lint
- test
- build
- deploy
test-job:
stage: test
before_script:
- apt-get update
- apt-get install -y --no-install-recommends gcc
- apt install -y default-libmysqlclient-dev
- pip3 install -r requirements.txt
script:
- pytest tests/test.py
After running this pipeline, with each pipeline, the pipeline time just increases. I was following these steps from GitLab documentation - https://docs.gitlab.com/ee/ci/caching/#cache-python-dependencies Although I am not using venv since it works without it. I am still not sure why the PIP_CACHE_DIR variable is needed if it is not used, but I followed the documentation.
What is the correct way to cache python dependencies? I would prefer not to use venv.
Upvotes: 12
Views: 13076
Reputation: 42582
The key to your problem lies in what you said: "I am not using venv since it works without it". It may seem to work, but it actually doesn't. In your defense, the GitLab documentation (which is generally very good), did a poor job at explaining how to properly cache Python dependencies.
The short answer is that you need to use virtual environments (or define environment variables (like PYTHONUSERBASE) to change the default site-packages path). This is because pip stores files not only in the .cache/pip
directory but also in other locations. For example, the site site-packages directory might be stored in /usr/local/lib/python3.13/site-packages
which is outside of the reach of the cache. Based on the tests I conducted, installing from cache (.cache/pip
) or just doing a clean download takes just as long.
So if your logs show something like Using cached flask-3.1.0-py3-none-any.whl.metadata
it means it found metadata in the cache, but it still needs to extract and install the package in the site-packages
(at least this is my understanding of this, but I am no expert in pip, so feel free to contradict me).
My understanding is that it is not easy to "convince" pip to store the site-packages
in the current directory so that GitLab can cache them. For that reason, one common approach is to use virtual environments which you can easily configure to use the current directory for all the files used.
You know that you have configured your cache correctly if you see in the logs something like:
Requirement already satisfied: Flask==3.1.0 in ./.venv/lib/python3.13/site-packages (from -r requirements.txt (line 1)) (3.1.0)
If you are looking for the long explanation, consider checking this article:
How to Cache Python Dependencies in GitLab CI/CD
Upvotes: 0
Reputation: 2462
Also: Gitlab documentation describes that cache should be set on the job; it cannot be set globally for the pipeline. This may cause your configuration to not work.
Upvotes: -1
Reputation: 585
PIP_CACHE_DIR
is a pip feature that can be used to set the cache dir.
The second answer to this question explains it.
There may be some disagreement on this, but I think that for something like pip packages or node modules, it is quicker to download them fresh for each pipeline.
When the packages are cached by Gitlab by using
cache:
paths:
- .cache/pip
The cache that Gitlab creates gets zipped and stored somewhere(where it gets stored depends on runner config). This requires zipping and uploading the cache. Then when another pipeline gets created, the cache needs to be downloaded and unpacked. If using a cache is slowing down job execution, then it might make sense to just remove the cache.
Upvotes: 6