Sayan
Sayan

Reputation: 3

Simple web-crawler in Python

I am self-teaching myself Python and came up with building a simple web-crawler engine. the codes are below,

def find_next_url(page):
    start_of_url_line = page.find('<a href')
    if start_of_url_line == -1:
        return None, 0
else:
    start_of_url = page.find('"http', start_of_url_line)
    if start_of_url == -1:
        return None, 0
    else:
        end_of_url = page.find('"', start_of_url + 1)
        one_url = page[start_of_url + 1 : end_of_url]
        return one_url, end_of_url 

def get_all_url(page):
p = []
while True:
    url, end_pos = find_next_url(page)
    if url:
        p.append(url)
        page = page[end_pos + 1 : ]
    else:
        break
return p

def union(a, b):
    for e in b:
    if e not in a:
        a.append(e)
    return a

def webcrawl(seed):
    tocrawl = [seed]
    crawled = []
    while True:
        page = tocrawl.pop()
        if page not in crawled:
            import urllib.request
            intpage = urllib.request.urlopen(page).read()
            openpage = str(intpage)
            union(tocrawl, get_all_url(openpage))
            crawled.append(page)
    return crawled

However I am always getting HTTP 403 error.

Upvotes: 0

Views: 653

Answers (3)

RGH
RGH

Reputation: 19

You might need to add request headers or other authentication. Try adding user agents to avoid in some cases reCaptcha.

Example:

    User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36

Upvotes: 1

rmq
rmq

Reputation: 1

As others have said, the error is not caused by the code itself, but you may want to try to do a couple things

  • Try adding exception handlers, maybe even ignore the problematic pages altogether for now to make sure the crawler is working as expected otherwise:

    def webcrawl(seed):
        tocrawl = [seed]
        crawled = []
        while tocrawl: # replace `while True` with an actual condition,
                       # otherwise you'll be stuck in an infinite loop
                       # until you hit an exception
            page = tocrawl.pop()
            if page not in crawled:
                import urllib.request
                try:
                    intpage = urllib.request.urlopen(page).read()
                    openpage = str(intpage)
                    union(tocrawl, get_all_url(openpage))
                    crawled.append(page)
                except urllib.error.HTTPError as e:  # catch an exception
                    if e.code == 401:  # check the status code and take action
                        pass  # or anything else you want to do in case of an `Unauthorized` error
                    elif e.code == 403:
                        pass  # or anything else you want to do in case of a `Forbidden` error
                    elif e.cide == 404:
                        pass   # or anything else you want to do in case of a `Not Found` error
                    # etc
                    else:
                        print('Exception:\n{}'.format(e))  # print an unexpected exception
                        sys.exit(1)  # finish the process with exit code 1 (indicates there was a problem)
        return crawled
    
  • Try adding a User-Agent header into your request. From urllib.request docs:

This is often used to “spoof” the User-Agent header, which is used by a browser to identify itself – some HTTP servers only allow requests coming from common browsers as opposed to scripts. For example, Mozilla Firefox may identify itself as "Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11", while urllib‘s default user agent string is "Python-urllib/2.6" (on Python 2.6).

So something like this might help to get around some of the 403 errors:

    headers = {'User-Agent':'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11'}
    req = urllib.request.Request(page, headers=headers)
    intpage = urllib.request.urlopen(req).read()
    openpage = str(intpage)

Upvotes: 0

Arman Ordookhani
Arman Ordookhani

Reputation: 6537

HTTP 403 error is not related to your code. It means URL being crawled is forbidden to access. Most of the time it means the page is only available to logged in users or a specific user.


I actually ran your code and got 403 with creativecommons link. The reason is urllib does not send Host header by default and you should add it manually to not get the error (Most servers will check the Host header and decide which content they should serve). You could also use Requests python package instead of builtin urllib that sends Host header by default and is more pythonic IMO.

I add a try-exept clause to catch and log errors then continue to crawl other links. There are a lot of broken links on the web.

from urllib.request import urlopen
from urllib.error import HTTPError
...
def webcrawl(seed):
    tocrawl = [seed]
    crawled = []
    while True:
        page = tocrawl.pop()
        if page not in crawled:
            try:
                intpage = urlopen(page).read()
                openpage = str(intpage)
                union(tocrawl, get_all_url(openpage))
                crawled.append(page)
            except HTTPError as ex:
                print('got http error while crawling', page)
    return crawled

Upvotes: 1

Related Questions