Michael
Michael

Reputation: 522

Python urllib.urlopen IOError

So I have the following lines of code in a function

sock = urllib.urlopen(url)
html = sock.read()
sock.close()

and they work fine when I call the function by hand. However, when I call the function in a loop (using the same urls as earlier) I get the following error:

> Traceback (most recent call last):
  File "./headlines.py", line 256, in <module>
    main(argv[1:])
  File "./headlines.py", line 37, in main
    write_articles(headline, output_folder + "articles_" + term +"/")
  File "./headlines.py", line 232, in write_articles
    print get_blogs(headline, 5)
  File "/Users/michaelnussbaum08/Documents/College/Sophmore_Year/Quarter_2/Innovation/Headlines/_code/get_content.py", line 41, in get_blogs
    sock = urllib.urlopen(url)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib.py", line 87, in urlopen
    return opener.open(url)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib.py", line 203, in open
    return getattr(self, name)(url)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib.py", line 314, in open_http
    if not host: raise IOError, ('http error', 'no host given')
IOError: [Errno http error] no host given

Any ideas?

Edit more code:

def get_blogs(term, num_results):
    search_term = term.replace(" ", "+")
    print "search_term: " + search_term
    url = 'http://blogsearch.google.com/blogsearch_feeds?hl=en&q='+search_term+'&ie=utf-8&num=10&output=rss'
    print "url: " +url  

    #error occurs on line below

    sock = urllib.urlopen(url)
    html = sock.read()
    sock.close()

def write_articles(headline, output_folder, num_articles=5):

    #calls get_blogs

    if not os.path.exists(output_folder):
    os.makedirs(output_folder)

    output_file = output_folder+headline.strip("\n")+".txt"
    f = open(output_file, 'a')
    articles = get_articles(headline, num_articles)
    blogs = get_blogs(headline, num_articles)


    #NEW FUNCTION
    #the loop that calls write_articles
    for term in trend_list: 
        if do_find_max == True:
        fill_search_term(term, output_folder)
    headlines = headline_process(term, output_folder, max_headlines, do_find_max)
    for headline in headlines:
    try:
        write_articles(headline, output_folder + "articles_" + term +"/")
    except UnicodeEncodeError:
        pass

Upvotes: 3

Views: 14769

Answers (3)

user1994702
user1994702

Reputation:

I had this problem when a variable I was concatenating with the url, in your case search_term

url = 'http://blogsearch.google.com/blogsearch_feeds?hl=en&q='+search_term+'&ie=utf-8&num=10&output=rss'

had a newline character at the end. So make sure you do

search_term = search_term.strip()

You might also want to do

search_term = urllib2.quote(search_term)

to make sure your string is safe for a url

Upvotes: 6

Eddy Pronk
Eddy Pronk

Reputation: 6695

use urllib2 instead if you don't want to handle reading on a per block basis yourself. This probably does what you expect.

import urllib2
req = urllib2.Request(url='http://stackoverflow.com/')
f = urllib2.urlopen(req)
print f.read()

Upvotes: 1

unutbu
unutbu

Reputation: 879491

In your function's loop, right before the call to urlopen, perhaps put a print statement:

print(url)
sock = urllib.urlopen(url)

This way, when you run the script and get the IOError, you will see the url which is causing the problem. The error "no host given" can be replicated if url equals something like 'http://'...

Upvotes: 1

Related Questions