user3739969
user3739969

Reputation: 63

Extract href from html

I am given the following html :

<A HREF="Acaryochloris_marina_MBIC11017_uid58167/"><IMG border="0" SRC="SOMETHING" ALT="[DIR] "></A> <A HREF="Acaryochloris_marina_MBIC11017_uid58167/">Acaryochloris_marina_MBIC11017_&gt;</A> Jun 12  2013        
<A HREF="Acetobacter_pasteurianus_386B_uid214433/"><IMG border="0" SRC="SOMETHING" ALT="[DIR] "></A> <A HREF="Acetobacter_pasteurianus_386B_uid214433/">Acetobacter_pasteurianus_386B_u&gt;</A> Aug  8  2013 

and many more... I want to extract the href from here.

Here's my python script : (page_source contains the html)

soup = BeautifulSoup(page_source)

links = soup.find_all('a',attrs={'href': re.compile("^http://")})

for tag in links:
    link = tag.get('href',None)
    if link != None:
        print link

But this keeps returning the following error :

    links = soup.find_all('A',attrs={'HREF': re.compile("^http://")})
TypeError: 'NoneType' object is not callable

Upvotes: 0

Views: 1590

Answers (2)

Martijn Pieters
Martijn Pieters

Reputation: 1124988

You are using BeautifulSoup version 3, not version 4. soup.find_all is then not interpreted as a method, but as a search for the first <find_all> element. Because there is no such element, soup.find_all resolves to None.

Install BeautifulSoup 4 instead, the import is:

from bs4 import BeautifulSoup

BeautifulSoup 3 is instead imported as from BeautifulSoup import BeautifulSoup.

If you are sure you wanted to use BeautifulSoup 3 (not recommended), then use:

links = soup.findAll('a', attrs={'href': re.compile("^http://")})

As a side note, because you limit your search to <a> tags with a certain value, *there will always be a href attribute on the elements that are found. Using .get() and testing for None is entirely redundant. The following is equivalent:

links = soup.find_all('a',attrs={'href': re.compile("^http://")})

for tag in links:
    link = tag['href']
    print link

BeautifulSoup 4 also supports CSS selectors, which could make your query a little more readable still, removing the need for you to specify a regular expression:

for tag in soup.select('a[href^=http://]'):
    link = tag['href']
    print link

Upvotes: 2

Thargor
Thargor

Reputation: 1872

Why not use the split command?

Iterate over all lines of the file and d something like that:

href = line.split("HREF=\"")[1].split("\"")[0]

Upvotes: -1

Related Questions