Martin Peck
Martin Peck

Reputation: 11564

How to Reduce the Install Size of a Spacy based Python Application

I'm building a docker container that contains a the Python library spacy. I'm now trying to reduce the size of this container, and spacy appears to be the main contributor to the disk size.

Without any models installed, and without any other code/dependencies etc, spacy consumes around 500MB of disk when installed! Does anyone have any useful hints/tips on installing spacy in a disk-space-friendly manner.

My repro steps are:

mkdir foo1                  # create a folder 
cd foo1                     # change directory
python3 -m venv .venv       # create virtual environment
source .venv/bin/activate   # activate virtual environment
pip install --upgrade pip   # upgrade pip
pip install spacy           # install spacy

After doing this, I then navigate into the following folder...

foo1/.venv/lib/python3.7/site-packages

... and can see that the spacy folder is very large:

$ du -sh spacy
425M    spacy

Specifically, it's the language folder that's large:

$ du -sh spacy/lang
401M spacy/lang

There are 52 languages in that folder, and for many situations I only care about one or two languages. Specifically, for my current situation, that's English.

When I look at the sizes, English is the 14th largest (only showing the top 14 in this list)...

$ du -sH spacy/lang/* | sort -n -r 

142024 spacy/lang/tr
86608 spacy/lang/pt
78368 spacy/lang/nb
76592 spacy/lang/da
74840 spacy/lang/sv
60672 spacy/lang/ca
50880 spacy/lang/es
48296 spacy/lang/fr
41688 spacy/lang/de
36960 spacy/lang/nl
34008 spacy/lang/it
32632 spacy/lang/ro
24160 spacy/lang/lt
8712 spacy/lang/en  <--- THE ONLY ONE I WANT

Is there a spacy-specifc way of installing spacy without all of these languages?

I can hack around post-install, but is there a safer way to install fewer languages?

Versions installed, on MacOS, by the above steps are as follows:

$ pip freeze
blis==0.2.4
certifi==2019.6.16
chardet==3.0.4
cymem==2.0.2
idna==2.8
murmurhash==1.0.2
numpy==1.16.4
plac==0.9.6
preshed==2.0.1
requests==2.22.0
spacy==2.1.6
srsly==0.0.7
thinc==7.0.8
tqdm==4.32.2
urllib3==1.25.3
wasabi==0.2.2

$ python --version
Python 3.7.4

Upvotes: 3

Views: 2088

Answers (2)

interfect
interfect

Reputation: 2877

If you tack an && rm -Rf foo1/.venv/lib/python3.7/site-packages/spacy/lang/tr onto the end of the RUN pip install spacy command that I presume you have in your Dockerfile, you can delete all the files for that language without letting them get saved into a layer in the Docker container.

I'm not sure if you would still have a working spacy after just ripping out the languages you didn't want, and you'd have to basically repeat the command for each language you don't want to keep, but it might work as a workaround until spacy makes itself smaller or more modular.

Upvotes: 1

Martin Peck
Martin Peck

Reputation: 11564

I raised this as an issue against the spacy project on GitHub, and it looks like this is a known issue, and that there are plans to address the size of spacy installs.

https://github.com/explosion/spaCy/issues/3983

So, at this time, there isn't a supported/recommended way to reduce the size of the package install.

Upvotes: 2

Related Questions