Will
Will

Reputation: 1622

Python script to fetch URL protected by DES/kerberos

I have a Python script that does an automatic download from a URL once a day.

Recently the authentication protecting the URL was changed. To get it to work with Internet Explorer I had to enable DES for Kerberos by adding SupportedEncryptionTypes " 0x7FFFFFFF" in a registry entry somewhere. Then it prompts me for my domain/user/password in IE when I browse to the site.

My python code that was working before is:

  def __build_ntlm_opener(self):
    passman = HTTPPasswordMgrWithDefaultRealm()
    passman.add_password(None, self.answers_url, self.ntlm_username, self.ntlm_password)

    ntlm_handler = HTTPNtlmAuthHandler(passman)

    opener = urllib.request.build_opener(ntlm_handler)
    opener.addheaders= [
        #('User-agent', 'Mozilla/5.0 (Windows NT 6.0; rv:5.0) Gecko/20100101 Firefox/5.0')
        ('User-agent', 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)')
    ]

    return opener

Now the code is failing with a simple 401 when using the opener:

urllib.error.HTTPError: HTTP Error 401: Unauthorized

I don't know much about Kerberos or DES but from what I see so far I can't figure out if urllib supports using these.

Is there any 3rd party library or trick I can use to get this working again?

Upvotes: 2

Views: 1897

Answers (1)

aychedee
aychedee

Reputation: 25579

You could try using selenium's webdriver to directly drive a browser. I do that sometimes when I want to scrape sites that are dynamically generated. Here's a code example for opening a page and entering a password

from selenium import webdriver

b = webdriver.Chrome()
b.get('http://www.example.com')
username_field = b.find_element_by_id('username')
username_field.send_keys('my_username')
password_field = b.find_element_by_id('password')
password_field.send_keys('secret')
login_button = b.find_element_by_link_text('login').click()

That would get you past a typical login screen of a web site. Then

b.page_source

Will give you the source code for the page. Even if it was mainly generated with Javascript.

The source code is very simple to parse: http://code.google.com/p/selenium/source/browse/trunk/py/selenium/webdriver/remote/webelement.py

Upvotes: 1

Related Questions