findAll() method does not work

Question

I am trying to remove all of the tag from the link that i got from crawling.

here is the code

request = urllib2.Request("http://sport.detik.com/sepakbola/")
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)

   for a in soup.findAll('a'):
   if 'http://sport.detik.com/sepakbola/read/' in a['href']:
            urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*,]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', a)

unfortunately, it does not work, and it says expected string or buffer in findAll(), is it like the output from for each is not a string? any help will be appriciated

thanks

Joachim Isaksson · Accepted Answer

a in your loop is not a string, it's a dictionary (or, specifically, a BeautifulSoup.Tag). In your if statement you correctly get the href string from the dictionary to compare with, but when matching the regex you're not.

Simply using the string a['href'] instead of the dictionary a in the regex match will fix your runtime error;

for a in soup.findAll('a'):
  if 'http://sport.detik.com/sepakbola/read/' in a['href']:
    urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*,]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', a['href'])

findAll() method does not work

Answers (2)

Related Questions