Reputation: 91
I've updated the DEFAULT_URL in downloader.py and I'm still getting the following error. I originally tried just nltk.downloader() and the file browser updated but when I tried to download, it still reverted back to the github site.
DEFAULT_URL = 'http://nltk.org/nltk_data/'
.
import nltk
nltk.set_proxy('proxyaddress',user=None)
dl = nltk.downloader.Downloader("http://nltk.org/nltk_data/")
dl.download('all')
[nltk_data] Downloading collection 'all'
[nltk_data] |
[nltk_data] | Downloading package abc to C:\nltk_data...
[nltk_data] | Error downloading 'abc' from
[nltk_data] | <https://raw.githubusercontent.com/nltk/nltk_data
[nltk_data] | /gh-pages/packages/corpora/abc.zip>: <urlopen
[nltk_data] | error [Errno 11004] getaddrinfo failed>
Why is this still defaulting to raw.githubusercontent.com/nltk/nltk_data?
Upvotes: 0
Views: 3050
Reputation: 50220
The problem comes from your proxy. I can't say what's wrong with your proxy configuration, but initializing a downloader with a custom download url works as intended (there is no need to modify the nltk source in nltk/downloader.py
):
dl = nltk.downloader.Downloader("http://example.com/my_corpus_data/index.xml")
Note that the custom url must resolve to an XML document describing the downloadable resources, in the format expected by the nltk
; the code in your question points to the human-readable list at http://nltk.org/nltk_data
, which will just result in an error. (Presumably your real code uses a different URL, and different code around the proxy settings.)
Anyway the problem has to be in your proxy, or the way you use it. The nltk's set_proxy
function just calls a couple of functions from urllib.request
to declare the proxy. It never comes near the nltk's downloader
module, so there's no way it could affect the downloader's defaults.
Upvotes: 1