Jesus Christ
Jesus Christ

Reputation: 21

Image scraper: urllib2.URLError: <urlopen error no host given>

I understand that this error is given because it doesn't have a url to request but I can't figure out why.

My code is a 4chan img scraper, it works on every board with no problem except board "wg" which is the wallpapers general board. For some reason, only on this board, It wont go to the next page to scrape the images and it give me the error "urllib2.URLError: "

Would really appreciate any help, I have no idea why this error only happens on wg, I'm thinking maybe it has to do with file size but that doesn't really make any sense in regards to the error.

Here's my code (below), and also here's a link to my github: https://github.com/devinatoms/4chanScraper/blob/master/4chanScrape.py

##@author klorox


from bs4 import BeautifulSoup
import requests
import re
import urllib2
import os
import collections

print"""

                )           )           (      *                   (         (                      (     
     )   (   ( /(  (     ( /(           )\ ) (  `   (              )\ )  (   )\ )   (               )\ )  
  ( /(   )\  )\()) )\    )\())         (()/( )\))(  )\ )          (()/(  )\ (()/(   )\          (  (()/(  
  )\())(((_)((_)((((_)( ((_)\           /(_)((_)()\(()/(           /(_)(((_) /(_)((((_)(  `  )  )\  /(_)) 
 ((_)\ )\___ _((_)\ _ )\ _((_)         (_)) (_()((_)/(_))_        (_)) )\___(_))  )\ _ )\ /(/( ((_)(_))   
| | (_((/ __| || (_)_\(_| \| |         |_ _||  \/  (_)) __|       / __((/ __| _ \ (_)_\(_((_)_\| __| _ \  
|_  _| | (__| __ |/ _ \ | .` |          | | | |\/| | | (_ |       \__ \| (__|   /  / _ \ | '_ \| _||   /  
  |_|   \___|_||_/_/ \_\|_|\_|         |___||_|  |_|  \___|       |___/ \___|_|_\ /_/ \_\| .__/|___|_|_\  
                                                                                         |_|              
                    written by klorox, some by Icewave                                                                                                                                                                                                                           
                                                                                         """



# Gather our HTML source code from the pages
def get_soup(url,header):
  return BeautifulSoup(urllib2.urlopen(urllib2.Request(url, headers=header)), 'lxml')

# Main logic function, we use this to re-iterate through the pages
def main(url):
    image_name = "image"
    print url
    header = {'User-Agent': 'Mozilla/5.0'} 
    r = requests.get(url)
    html_content = r.text
    soup = BeautifulSoup(html_content, 'lxml')
    anchors = soup.findAll('a')
    links = [a['href'] for a in anchors if a.has_attr('href')]

# Grabs all the a anchors from the HTML source which contain our image links
    def get_anchors(links):
        for a in anchors:
            links.append(a['href'])
        return links

# Gather the raw links and sort them        
    raw_links = get_anchors(links)
    raw_links.sort()

# Parse out any duplicate links
    def get_duplicates(arr):
        dup_arr = arr[:]
        for i in set(arr):
            dup_arr.remove(i)       
        return list(set(dup_arr))   

# Define our list of new links and call the function to parse out duplicates
    new_elements = get_duplicates(raw_links)

# Get the image links from the raw links, make a request, then write them to a folder.
    def get_img():      
        for element in new_elements:
            if ".jpg" in str(element) or '.png' in str(element) or '.gif' in str(element):
                retries = 0
                passed = False
                while(retries < 3): 
                    try:
                        if "https:" not in element and "http:" not in element:
                            element = "http:"+element           
                        raw_img = urllib2.urlopen(element).read()
                        cntr = len([i for i in os.listdir(dirr) if image_name in i]) + 1
                        print("Saving img: " + str(cntr) + "  :      " + str(element) + " to: "+ dirr )
                        with open(dirr + image_name + "_"+ str(cntr)+".jpg", 'wb') as f:
                            f.write(raw_img)
                        passed = True
                        break
                    except urllib2.URLError, e:
                        retries += 1
                        print "Failed on", element, "(Retrying", retries, ")"
                if not passed:
                    print "Failed on ", element, "skipping..."

# Call our image writing function           
    get_img()

# Ask the user which board they would like to use
print """Boards: [a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vr / w / wg] [i / ic] [r9k] [s4s] [cm / hm / lgbt / y] [3 / aco / adv / an / asp / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / qst / sci / soc / sp / tg / toy / trv / tv / vp / wsg / wsr / x]""" 
print "\n"
board = raw_input("Enter the board letter (Example: b, p, w): ")
dirr = raw_input("Enter the working directory (USE DOUBLE SLASHES): (Example: C:\\\Users\\\Username\\\Desktop\\\Folder\\: ")
# Define our starting page number and first try value           
page = 2
firstTry = True

# Check if this is the first iteration
if firstTry == True:
    url = "http://boards.4chan.org/"+board+"/"
    firstTry = False
    main(url)
    # After first iteration, this loop changes the url after each completed page by calling our main function again each time.
    while page <= 10 and page >= 2 and firstTry == False:
        firstTry == False
        url = "http://boards.4chan.org/"+board+"/"+ str(page) +"/"
        page = page + 1
        p = page - 1
        print("Page: " + str(p))
        main(url)

Upvotes: 0

Views: 168

Answers (1)

Jesus Christ
Jesus Christ

Reputation: 21

So nevermind I fixed it with a little help.

The solution was to use a try catch exception and check for http or https then redirect the URL appropriately. The error was probably caused by the servers anti-mass request prevention probably(assumption).

Upvotes: 1

Related Questions