njfrazie
njfrazie

Reputation: 91

Python 3.5: NLTK Download Default URL will not change

I've updated the DEFAULT_URL in downloader.py and I'm still getting the following error. I originally tried just nltk.downloader() and the file browser updated but when I tried to download, it still reverted back to the github site.

DEFAULT_URL = 'http://nltk.org/nltk_data/'

.

import nltk
nltk.set_proxy('proxyaddress',user=None)
dl = nltk.downloader.Downloader("http://nltk.org/nltk_data/")
dl.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    |
[nltk_data]    | Downloading package abc to C:\nltk_data...
[nltk_data]    | Error downloading 'abc' from
[nltk_data]    |     <https://raw.githubusercontent.com/nltk/nltk_data
[nltk_data]    |     /gh-pages/packages/corpora/abc.zip>:   <urlopen
[nltk_data]    |     error [Errno 11004] getaddrinfo failed>

Why is this still defaulting to raw.githubusercontent.com/nltk/nltk_data?

Upvotes: 0

Views: 3050

Answers (1)

alexis
alexis

Reputation: 50220

The problem comes from your proxy. I can't say what's wrong with your proxy configuration, but initializing a downloader with a custom download url works as intended (there is no need to modify the nltk source in nltk/downloader.py):

dl = nltk.downloader.Downloader("http://example.com/my_corpus_data/index.xml")

Note that the custom url must resolve to an XML document describing the downloadable resources, in the format expected by the nltk; the code in your question points to the human-readable list at http://nltk.org/nltk_data, which will just result in an error. (Presumably your real code uses a different URL, and different code around the proxy settings.)

Anyway the problem has to be in your proxy, or the way you use it. The nltk's set_proxy function just calls a couple of functions from urllib.request to declare the proxy. It never comes near the nltk's downloader module, so there's no way it could affect the downloader's defaults.

Upvotes: 1

Related Questions