Dave
Dave

Reputation: 523

How to cache python dependecies in Gitlab CI/CD without using venv?

I am trying to use cache in my .gitlab-ci.yml file, but the time only increases (testing by adding blank lines). I want to cache python packages I install with pip. Here is the stage where I install and use these packages (other stages uses Docker):

image: python:3.8-slim-buster

variables:
  PIP_CACHE_DIR: "$CI_PROJECT_DIR/.cache/pip"

cache:
  paths:
    - .cache/pip

stages:
  - lint
  - test
  - build
  - deploy

test-job:
  stage: test
  before_script:
    - apt-get update
    - apt-get install -y --no-install-recommends gcc
    - apt install -y default-libmysqlclient-dev
    - pip3 install -r requirements.txt
  script:
    - pytest tests/test.py

After running this pipeline, with each pipeline, the pipeline time just increases. I was following these steps from GitLab documentation - https://docs.gitlab.com/ee/ci/caching/#cache-python-dependencies Although I am not using venv since it works without it. I am still not sure why the PIP_CACHE_DIR variable is needed if it is not used, but I followed the documentation.

What is the correct way to cache python dependencies? I would prefer not to use venv.

Upvotes: 12

Views: 13076

Answers (3)

Valentin Despa
Valentin Despa

Reputation: 42582

The key to your problem lies in what you said: "I am not using venv since it works without it". It may seem to work, but it actually doesn't. In your defense, the GitLab documentation (which is generally very good), did a poor job at explaining how to properly cache Python dependencies.

The short answer is that you need to use virtual environments (or define environment variables (like PYTHONUSERBASE) to change the default site-packages path). This is because pip stores files not only in the .cache/pip directory but also in other locations. For example, the site site-packages directory might be stored in /usr/local/lib/python3.13/site-packages which is outside of the reach of the cache. Based on the tests I conducted, installing from cache (.cache/pip) or just doing a clean download takes just as long.

So if your logs show something like Using cached flask-3.1.0-py3-none-any.whl.metadata it means it found metadata in the cache, but it still needs to extract and install the package in the site-packages (at least this is my understanding of this, but I am no expert in pip, so feel free to contradict me).

My understanding is that it is not easy to "convince" pip to store the site-packages in the current directory so that GitLab can cache them. For that reason, one common approach is to use virtual environments which you can easily configure to use the current directory for all the files used.

You know that you have configured your cache correctly if you see in the logs something like:

Requirement already satisfied: Flask==3.1.0 in ./.venv/lib/python3.13/site-packages (from -r requirements.txt (line 1)) (3.1.0)

If you are looking for the long explanation, consider checking this article:

How to Cache Python Dependencies in GitLab CI/CD

Upvotes: 0

dhr_p
dhr_p

Reputation: 2462

Also: Gitlab documentation describes that cache should be set on the job; it cannot be set globally for the pipeline. This may cause your configuration to not work.

Upvotes: -1

Benjamin
Benjamin

Reputation: 585

PIP_CACHE_DIR is a pip feature that can be used to set the cache dir.

The second answer to this question explains it.

There may be some disagreement on this, but I think that for something like pip packages or node modules, it is quicker to download them fresh for each pipeline.

When the packages are cached by Gitlab by using

cache:
  paths:
    - .cache/pip

The cache that Gitlab creates gets zipped and stored somewhere(where it gets stored depends on runner config). This requires zipping and uploading the cache. Then when another pipeline gets created, the cache needs to be downloaded and unpacked. If using a cache is slowing down job execution, then it might make sense to just remove the cache.

Upvotes: 6

Related Questions