Extract href from html

Question

I am given the following html :

 Acaryochloris_marina_MBIC11017_> Jun 12  2013        
 Acetobacter_pasteurianus_386B_u> Aug  8  2013

and many more... I want to extract the href from here.

Here's my python script : (page_source contains the html)

soup = BeautifulSoup(page_source)

links = soup.find_all('a',attrs={'href': re.compile("^http://")})

for tag in links:
    link = tag.get('href',None)
    if link != None:
        print link

But this keeps returning the following error :

    links = soup.find_all('A',attrs={'HREF': re.compile("^http://")})
TypeError: 'NoneType' object is not callable

Martijn Pieters · Accepted Answer

You are using BeautifulSoup version 3, not version 4. soup.find_all is then not interpreted as a method, but as a search for the first element. Because there is no such element, soup.find_all resolves to None.

Install BeautifulSoup 4 instead, the import is:

from bs4 import BeautifulSoup

BeautifulSoup 3 is instead imported as from BeautifulSoup import BeautifulSoup.

If you are sure you wanted to use BeautifulSoup 3 (not recommended), then use:

links = soup.findAll('a', attrs={'href': re.compile("^http://")})

As a side note, because you limit your search to tags with a certain value, *there will always be a href attribute on the elements that are found. Using .get() and testing for None is entirely redundant. The following is equivalent:

links = soup.find_all('a',attrs={'href': re.compile("^http://")})

for tag in links:
    link = tag['href']
    print link

BeautifulSoup 4 also supports CSS selectors, which could make your query a little more readable still, removing the need for you to specify a regular expression:

for tag in soup.select('a[href^=http://]'):
    link = tag['href']
    print link

Extract href from html

Answers (2)

Related Questions