Reputation: 123
I am trying to make a Python Crawler using BeautifulSoup, but I receive an error that I am trying to write a non-String or other character buffer type to a file. From examining the program output, I found that my list contains many items that are None. In addition to having None, I also have a lot of images and things that are not links but are image links inside of my list. How can I only add the URLs to my list?
import urllib
from BeautifulSoup import *
try:
with open('url_file', 'r') as f:
url_list = [line.rstrip('\n') for line in f]
f.close()
with open('old_file', 'r') as x:
old_list = [line.rstrip('\n') for line in f]
f.close()
except:
url_list = list()
old_list = list()
#for Testing
url_list.append("http://www.dinamalar.com/")
count = 0
for item in url_list:
try:
count = count + 1
if count > 5:
break
html = urllib.urlopen(item).read()
soup = BeautifulSoup(html)
tags = soup('a')
for tag in tags:
if tag in old_list:
continue
else:
url_list.append(tag.get('href', None))
old_list.append(item)
#for testing
print url_list
except:
continue
with open('url_file', 'w') as f:
for s in url_list:
f.write(s)
f.write('\n')
with open('old_file', 'w') as f:
for s in old_list:
f.write(s)
Upvotes: 1
Views: 104
Reputation: 180540
First off, use bs4 not the no longer maintained BeautifulSoup3, your error is because not all anchors have hrefs so you are trying to write None which causes your error, use find_all and set href=True so you only find anchor tags that have a href attribute:
soup = BeautifulSoup(html)
tags = soup.find_all("a", href=True)
Also never use blanket except statements, always catch the errors you expect and at least print them when they do occur. As far as I also have a lot of images and things that are not links goes, if you want to filter for certain links then you have to be more specific, either look for the tags that contain what you are interested in if possible, use a regex href=re.compile("some_pattern")
or use a css selectors:
# hrefs starting with something
"a[href^=something]"
# hrefs that contain something
"a[href*=something]"
# hrefs ending with something
"a[href$=something]"
Only you know the structure of the html and what you want so what you use is completely up to you to decide.
Upvotes: 1