Tom Cornebize
Tom Cornebize

Reputation: 1422

Installing nltk data dependencies in setup.py script

I use NLTK with wordnet in my project. I did the installation manually on my PC, with pip: pip3 install nltk --user in a terminal, then nltk.download() in a python shell to download wordnet.

I want to automatize these with a setup.py file, but I don't know a good way to install wordnet.

For the moment, I have this piece of code after the call to setup ("nltk" is in the install_requires list of the call to setup):

import sys
if 'install' in sys.argv:
    import nltk
    nltk.download("wordnet")

Is there a better way to do this?

Upvotes: 16

Views: 4752

Answers (4)

Brendan Martin
Brendan Martin

Reputation: 649

This setup worked for me:

from setuptools import setup, find_packages
from setuptools.command.install import install

class InstallCommand(install):
    def run(self):
        install.run(self)
        import nltk
        nltk.download('wordnet')

setup(
    # other options...

    install_requires=['nltk'],
    setup_requires=['nltk'],
    cmdclass={
        'install': InstallCommand,
    }
)

Upvotes: 0

Álvaro H.G
Álvaro H.G

Reputation: 602

As stated in this thread, external data should not be handled by setuptools in setup.py. As an alternative I suggest that in the __init__.py file of your package you include the following lines (putting the case that you want to download the punkt and stopwords) :

__version__ = "x.x.x"
__organization__ = "your_organization"  
import nltk 
nltk.download("stopwords") 
nltk.download("punkt")  

This way the files will not be downloaded when the package is installed, but when it is imported (i.e. import my_package).


As an example I share a link to a python library that does just this.

First you would have to install the library:

pip install -U pyleetspeak

And then importing the library will download the NLTK files:

import pyleetspeak
pyleetspeak.__version__

enter image description here

Upvotes: 1

asmaier
asmaier

Reputation: 11756

I managed to install the NLTK data in setup.py by overriding cmdclass with my own Install class :

from setuptools import setup, find_packages
from setuptools.command.install import install as _install


class Install(_install):
    def run(self):
        _install.do_egg_install(self)
        import nltk
        nltk.download("popular")

setup(...
    cmdclass={'install': Install},
    ...
    install_requires=[
      'nltk',
      ],
    setup_requires=['nltk']
    ...
   )

It is important to use the method do_egg_install() in your run() method to make sure nltk gets installed, before import nltk is called (See also here python setuptools install_requires is ignored when overriding cmdclass). Also don't forget to add nltk to setup_requires.

Upvotes: 14

transcranial
transcranial

Reputation: 381

You can also automate installation with a shell script, for example, running (after pip installing nltk):

python -m nltk.downloader -d /usr/share/nltk_data wordnet

Upvotes: 3

Related Questions