Moni
Moni

Reputation: 949

BertTokenizer.from_pretrained errors out with "Connection error"

I am trying to download the tokenizer from Huggingface for BERT.

I am executing:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Error:

<Path>\tokenization_utils_base.py in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
   1663                         resume_download=resume_download,
   1664                         local_files_only=local_files_only,
-> 1665                         use_auth_token=use_auth_token,
   1666                     )
   1667 

<Path>\file_utils.py in cached_path(url_or_filename, cache_dir, force_download, proxies, resume_download, user_agent, extract_compressed_file, force_extract, use_auth_token, local_files_only)
   1140             user_agent=user_agent,
   1141             use_auth_token=use_auth_token,
-> 1142             local_files_only=local_files_only,
   1143         )
   1144     elif os.path.exists(url_or_filename):

<Path>\file_utils.py in get_from_cache(url, cache_dir, force_download, proxies, etag_timeout, resume_download, user_agent, use_auth_token, local_files_only)
   1347                 else:
   1348                     raise ValueError(
-> 1349                         "Connection error, and we cannot find the requested files in the cached path."
   1350                         " Please try again or make sure your Internet connection is on."
   1351                     )

ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

Based on a similar discussion on github in huggingface's repo, I gather that the file that the above call wants to download is: https://huggingface.co/bert-base-uncased/resolve/main/config.json

While I can access that json file perfectly well on my browser, I can not download it via requests. The error I get is:

>> import requests as r
>> r.get('https://huggingface.co/bert-base-uncased/resolve/main/config.json')
...
requests.exceptions.SSLError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /bert-base-uncased/resolve/main/config.json (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')])")))

While examining the certificate of the page - https://huggingface.co/bert-base-uncased/resolve/main/config.json, I see that it is signed by my IT department not the standard CA root I would expect to find. Based on discussion here, it looks like it is plausible for SSL proxies to do something like this.

My IT department's certificate is in the trusted authorities list. But requests does not seem to be considering that list for trusting certificates.

Taking a cue from a stack-overflow discussion on how to let requests trust a self-signed certificate I have also tried append cacert.pem (file pointed to by curl-config --ca) with the ROOT certificate that appears for the huggingface and adding the path of this pem to REQUESTS_CA_BUNDLE

export REQUESTS_CA_BUNDLE=/mnt/<path>/wsl-anaconda/ssl/cacert.pem

But it did not help at all.

Would you know how I can let requests know that it is OK to trust my IT department's certificate ?

P.S: If it matters, I am working on windows and am facing this in WSL as well.

Upvotes: 4

Views: 12573

Answers (1)

Moni
Moni

Reputation: 949

I could eventually make everything work - sharing the same here, just in case it will be useful for anyone else in future.

The solution is quite simple, something that I had tried initially, but had made a minor mistake while trying. Anyways, here goes the solution:

  1. Access the URL (huggingface.co URL in my case) from browser and access the certificate that accompanies the site.
    a. In most browsers (chrome / firefox / edge), you would be able to access it by clicking on the "Lock" icon in the address bar.

  2. Save all the certificates - all the way up to the root certificate.
    a. I think, technically, you can just save the root certificate and it will still work, but I have not tried that. I may update this, if I get around to try this out. If you happen to try it before me, please do comment.

  3. Follow the steps mentioned in this stack overflow answer to fetch the CA Bundle and open it up in an editor to append the file with the certificates downloaded in the previous step.
    a. The original CA bundle file has heading lines before each certificate, mentioning which CA root the certificate belongs to. This is not needed for the certificates we want to add. I had done this and I guess an extra space, carriage return etc. may have caused it to not work for me earlier.

  4. In my python program, I updated the environment variable to point to the updated CA root bundle

    os.environ['REQUESTS_CA_BUNDLE'] = 'path/cacert.crt'

One may think that since most python packages use "requests" to make such GET calls and "requests" uses the certificates pointed by the "certifi" package. So, why not find the location of the certificates pointed by certifi and update that. The issue with that it - whenever you update a package using conda, certifi may get updated as well, resulting in your changes to be washed away. Hence, I found dynamically updating the environment variable to be a better option.

Cheers

Upvotes: 9

Related Questions