Kyle Hikalea
Kyle Hikalea

Reputation: 33

Python Scraper - Socket Error breaks script if target is 404'd

Encountered an error while building a web scraper to compile data and output into XLS format; when testing again a list of domains in which I wish to scrape from, the program faulters when it recieves a socket error. Hoping to find an 'if' statement that would null parsing a broken website and continue through my while-loop. Any ideas?

workingList = xlrd.open_workbook(listSelection)
workingSheet = workingList.sheet_by_index(0)
destinationList = xlwt.Workbook()
destinationSheet = destinationList.add_sheet('Gathered')
startX = 1
startY = 0
while startX != 21:
    workingCell = workingSheet.cell(startX,startY).value
    print ''
    print ''
    print ''
    print workingCell
    #Setup
    preSite = 'http://www.'+workingCell
    theSite = urlopen(preSite).read()
    currentSite = BeautifulSoup(theSite)
    destinationSheet.write(startX,0,workingCell)

And here's the error:

Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    homeMenu()
  File "C:\Python27\farming.py", line 31, in homeMenu
    openList()
  File "C:\Python27\farming.py", line 79, in openList
    openList()
  File "C:\Python27\farming.py", line 83, in openList
    openList()
  File "C:\Python27\farming.py", line 86, in openList
    homeMenu()
  File "C:\Python27\farming.py", line 34, in homeMenu
    startScrape()
  File "C:\Python27\farming.py", line 112, in startScrape
    theSite = urlopen(preSite).read()
  File "C:\Python27\lib\urllib.py", line 84, in urlopen
    return opener.open(url)
  File "C:\Python27\lib\urllib.py", line 205, in open
    return getattr(self, name)(url)
  File "C:\Python27\lib\urllib.py", line 342, in open_http
    h.endheaders(data)
  File "C:\Python27\lib\httplib.py", line 951, in endheaders
    self._send_output(message_body)
  File "C:\Python27\lib\httplib.py", line 811, in _send_output
    self.send(msg)
  File "C:\Python27\lib\httplib.py", line 773, in send
    self.connect()
  File "C:\Python27\lib\httplib.py", line 754, in connect
    self.timeout, self.source_address)
  File "C:\Python27\lib\socket.py", line 553, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
IOError: [Errno socket error] [Errno 11004] getaddrinfo failed

Upvotes: 1

Views: 678

Answers (1)

John Machin
John Machin

Reputation: 83002

Ummm that looks like the error I get when my internet connection is down. HTTP 404 errors are what you get when you do have a connection but the URL that you specify can't be found.

There's no if statement to handle exceptions; you need to "catch" them using the try/except construct.

Update: Here's a demonstration:

import urllib

def getconn(url):
    try:
        conn = urllib.urlopen(url)
        return conn, None
    except IOError as e:
        return None, e

urls = """
    qwerty
    http://www.foo.bar.net
    http://www.google.com
    http://www.google.com/nonesuch
    """
for url in urls.split():
    print
    print url
    conn, exc = getconn(url)
    if conn:
        print "connected; HTTP response is", conn.getcode()
    else:
        print "failed"
        print exc.__class__.__name__
        print str(exc)
        print exc.args

Output:

qwerty
failed
IOError
[Errno 2] The system cannot find the file specified: 'qwerty'
(2, 'The system cannot find the file specified')

http://www.foo.bar.net
failed
IOError
[Errno socket error] [Errno 11004] getaddrinfo failed
('socket error', gaierror(11004, 'getaddrinfo failed'))

http://www.google.com
connected; HTTP response is 200

http://www.google.com/nonesuch
connected; HTTP response is 404

Note that so far we have just opened the connection. Now what you need to do is check the HTTP response code and decide whether there is anything worth retrieving using conn.read()

Upvotes: 5

Related Questions