Reputation: 71
I am trying to remove all of the tag from the link that i got from crawling.
here is the code
request = urllib2.Request("http://sport.detik.com/sepakbola/")
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
for a in soup.findAll('a'):
if 'http://sport.detik.com/sepakbola/read/' in a['href']:
urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', a)
unfortunately, it does not work, and it says expected string or buffer in findAll(), is it like the output from for each is not a string? any help will be appriciated
thanks
Upvotes: 0
Views: 193
Reputation: 9657
Indentation of the code is not correct here. Please fix it. Change the last line as:
urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', a['href'])
a
here is <class 'bs4.element.Tag'>
type, not a string. So you are getting the error. Change it to a['href']
which is a <type 'str'>
.
Upvotes: 2
Reputation: 181077
a
in your loop is not a string, it's a dictionary (or, specifically, a BeautifulSoup.Tag). In your if
statement you correctly get the href
string from the dictionary to compare with, but when matching the regex you're not.
Simply using the string a['href']
instead of the dictionary a
in the regex match will fix your runtime error;
for a in soup.findAll('a'):
if 'http://sport.detik.com/sepakbola/read/' in a['href']:
urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', a['href'])
Upvotes: 0