vinita
vinita

Reputation: 597

understanding nltk with python

My nltk data is ~/nltk_data/corpora/words/(en,en-basic,README)

According to __init__.py inside ~/lib/python2.7/site-packages/nltk/corpus, to read a list of the words in the Brown Corpus, use nltk.corpus.brown.words():

from nltk.corpus import brown
print brown.words()
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

This __init__.py has

words = LazyCorpusLoader(
    'words', WordListCorpusReader, r'(?!README|\.).*')
  1. So when I write from nltk.corpus import words, am I importing the 'words' function from __init__.py which resides in directory python2.7/site-packages/nltk/corpus?

  2. Also why does this happen:

     import nltk.corpus.words
     ImportError: No module named words
     from nltk.copus import words
     # WORKS FINE
    
  3. The "brown" corpus resides inside ~/nltk_data/corpora (and not in nltk/corpus). So why does this command work?

    from nltk.corpus import brown
    

    Shouldn't it be this?

    from nltk_data.corpora import brown
    

Upvotes: 1

Views: 1014

Answers (2)

badc0re
badc0re

Reputation: 3533

1.] Yes - by using LazyCorpusLoader from util where you can find the following description:

"""
    A proxy object which is used to stand in for a corpus object
    before the corpus is loaded.  This allows NLTK to create an object
    for each corpus, but defer the costs associated with loading those
    corpora until the first time that they're actually accessed.

    The first time this object is accessed in any way, it will load
    the corresponding corpus, and transform itself into that corpus
    (by modifying its own ``__class__`` and ``__dict__`` attributes).

    If the corpus can not be found, then accessing this object will
    raise an exception, displaying installation instructions for the
    NLTK data package.  Once they've properly installed the data
    package (or modified ``nltk.data.path`` to point to its location),
    they can then use the corpus object without restarting python.
    """

3.] nltk_data is the folder where the data is, that doesn't suppose to mean that the module is also in that folder (The data is downloaded from nltk_data)

NLTK has built-in support for dozens of corpora and trained models, as listed below. To use these within NLTK we recommend that you use the NLTK corpus downloader, >>> nltk.download()

Upvotes: 0

viraptor
viraptor

Reputation: 34205

Re. point 2: You can import either a module (import module.submodule), or an object from a module (from module.submodule import variable). While you can treat a module as a variable, because it actually is a variable in that scope (from module import submodule), it doesn't work the other way. That's why when you try doing import module.submodule.variable, it fails.

Re. point 3: Depends on what nltk.corpus does. Maybe it searches/loads the nltk_data for you automatically.

Upvotes: 2

Related Questions