Marko
Marko

Reputation: 729

Error downloading HTTPS webpage

I am downloading some data from some https webpage https://www.spar.si/sl_SI/zaposlitev/prosta-delovna-mesta-.html, so I get this error due to HTTPS. When I change the webpage manually to HTTP, it downloads fine. I was looking for similar examples to fix this but I did not find any. Do you have some idea what to do ?

Traceback (most recent call last):
  File "down.py", line 34, in <module>
    soup = BeautifulSoup(urllib.urlopen(url).read(), "html.parser")
  File "g:\python\Lib\urllib.py", line 87, in urlopen
    return opener.open(url)
  File "g:\python\Lib\urllib.py", line 213, in open
    return getattr(self, name)(url)
  File "g:\python\Lib\urllib.py", line 443, in open_https
    h.endheaders(data)
  File "g:\python\Lib\httplib.py", line 1049, in endheaders
    self._send_output(message_body)
  File "g:\python\Lib\httplib.py", line 893, in _send_output
    self.send(msg)
  File "g:\python\Lib\httplib.py", line 855, in send
    self.connect()
  File "g:\python\Lib\httplib.py", line 1274, in connect
    server_hostname=server_hostname)
  File "g:\python\Lib\ssl.py", line 352, in wrap_socket
    _context=self)
  File "g:\python\Lib\ssl.py", line 579, in __init__
    self.do_handshake()
  File "g:\python\Lib\ssl.py", line 808, in do_handshake
    self._sslobj.do_handshake()
IOError: [Errno socket error] [SSL: UNKNOWN_PROTOCOL] unknown protocol (_ssl.c:5
90)

This is my programme:

#!/usr/bin/python
# -*- coding: utf-8 -*-
# encoding=UTF-8  
#
# DOWNLOADER
# To grab the text content of webpages and save it to TinyDB database.


import re, time, urllib, tinydb
from bs4 import BeautifulSoup 


start_time = time.time()



#Open file with urls.
with open("G:/myVE/vacancies/urls2.csv") as f:
    lines = f.readlines()

#Open file to write HTML to.
with open("G:/myVE/downloader/urls2_html.txt", 'wb') as g:

    #We parse the content of url file to get just urls without the first line and without the text.
    for line in lines[1:len(lines)]:

        #Read the url from the file.
        #url = line.split(",")[0]
        url = line

        print "test"

        #Read the HTML of url
        soup = BeautifulSoup(urllib.urlopen(url).read(), "html.parser")

        print url
        #Mark of new HTML in HTML file.
        g.write("\n\nNEW HTML\n\n")

        #Write new HTML to file.
        g.write(str(soup))







print "Html saved to html.txt"
print "--- %s seconds ---" % round((time.time() - start_time),2)






"""
#We read HTML of the employment webpage that we intend to parse.
soup = BeautifulSoup(urllib.urlopen('http://www.simplybusiness.co.uk/about-us/careers/jobs/').read(), "html.parser")



#We write HTML to a file.
with open("E:/analitika/SURS/tutorial/tutorial/html.txt", 'wb') as f:
   f.write(str(soup)) 



print "Html saved to html.txt"
print "--- %s seconds ---" % round((time.time() - start_time),2)
"""

Thank you!

Upvotes: 0

Views: 1531

Answers (1)

dstudeba
dstudeba

Reputation: 9038

You should use the requests library, see http://docs.python-requests.org/en/latest/user/advanced/#ssl-cert-verification as a reference.

Updated to add

Now with your url here is an example with the requests library.

import requests

url = "https://www.spar.si/sl_SI/zaposlitev/prosta-delovna-mesta-.html" 
r = requests.get(url, verify=True)
print(r.text)

Here is an example with beautifulsoup and Python 3.3 which also seems to work.

import urllib
from bs4 import BeautifulSoup

url = "https://www.spar.si/sl_SI/zaposlitev/prosta-delovna-mesta-.html" 
soup = BeautifulSoup(urllib.request.urlopen(url).read(), "html.parser")
print(soup)

Upvotes: 1

Related Questions