Python - Problems creating a list of URLs using BeautifulSoup

Question

I am trying to make a Python Crawler using BeautifulSoup, but I receive an error that I am trying to write a non-String or other character buffer type to a file. From examining the program output, I found that my list contains many items that are None. In addition to having None, I also have a lot of images and things that are not links but are image links inside of my list. How can I only add the URLs to my list?

    import urllib
    from BeautifulSoup import *

    try:
        with open('url_file', 'r') as f:
            url_list = [line.rstrip('
') for line in f]
            f.close()
        with open('old_file', 'r') as x:
            old_list = [line.rstrip('
') for line in f]
            f.close()
    except:
        url_list = list()
        old_list = list()
        #for Testing
        url_list.append("http://www.dinamalar.com/")


    count = 0


    for item in url_list:
        try:
            count = count + 1
            if count > 5:
                break

            html = urllib.urlopen(item).read()
            soup = BeautifulSoup(html)
            tags = soup('a')

            for tag in tags:

                if tag in old_list:
                    continue
                else:
                    url_list.append(tag.get('href', None))


            old_list.append(item)
            #for testing
            print url_list
        except:
            continue

    with open('url_file', 'w') as f:
        for s in url_list:
            f.write(s)
            f.write('
')


    with open('old_file', 'w') as f:
        for s in old_list:
            f.write(s)

Padraic Cunningham · Accepted Answer

First off, use bs4 not the no longer maintained BeautifulSoup3, your error is because not all anchors have hrefs so you are trying to write None which causes your error, use find_all and set href=True so you only find anchor tags that have a href attribute:

soup = BeautifulSoup(html)
tags = soup.find_all("a", href=True)

Also never use blanket except statements, always catch the errors you expect and at least print them when they do occur. As far as I also have a lot of images and things that are not links goes, if you want to filter for certain links then you have to be more specific, either look for the tags that contain what you are interested in if possible, use a regex href=re.compile("some_pattern") or use a css selectors:

# hrefs starting with something
"a[href^=something]"

# hrefs that contain something
"a[href*=something]"

# hrefs ending with  something
"a[href$=something]"

Only you know the structure of the html and what you want so what you use is completely up to you to decide.

Python - Problems creating a list of URLs using BeautifulSoup

Answers (1)

Related Questions