cwal
cwal

Reputation: 3872

python/beautifulsoup to find all <a href> with specific anchor text

I am trying to use beautiful soup to parse html and find all href with a specific anchor tag

<a href="http://example.com">TEXT</a>
<a href="http://example.com/link">TEXT</a>
<a href="http://example.com/page">TEXT</a>

all the links I am looking for have the exact same anchor text, in this case TEXT. I am NOT looking for the word TEXT, I want to use the word TEXT to find all the different HREF.

For clarification looking for something similar to using the class to parse for the links

<a href="http://example.com" class="visible">TEXT</a>
<a href="http://example.com/link" class="visible">TEXT</a>
<a href="http://example.com/page" class="visible">TEXT</a>

and then using

findAll('a', 'visible')

except the HTML I am parsing doesn't have a class but always the same anchor text.

Upvotes: 24

Views: 58613

Answers (2)

cottontail
cottontail

Reputation: 23011

Since BeautifulSoup 4.4.0, text= parameter has been deprecated in favor of string=. So to find all anchor tags with a specific text, you can use the following:

[elm['href'] for elm in soup.find_all("a", string='TEXT')]

The above check filters tags where the string matches exactly. If you have other conditions such as the anchor text has to start with a specific string, you can also pass regex or a function that filters for that:

# filter anchor tags whose text starts with `TEXT`
import re
[elm['href'] for elm in soup.find_all("a", string=re.compile("^TEXT"))]

# or a plain string check
[elm['href'] for elm in soup.find_all("a", string=lambda x: x.startswith('TEXT'))]

Finally, since .find_all or .select return a ResultSet object which is essentially a Python list, you can just filter its result using an if statement:

[elm['href'] for elm in soup.find_all("a") if elm.string == 'TEXT']

Upvotes: 1

RocketDonkey
RocketDonkey

Reputation: 37249

Would something like this work?

In [39]: from bs4 import BeautifulSoup

In [40]: s = """\
   ....: <a href="http://example.com">TEXT</a>
   ....: <a href="http://example.com/link">TEXT</a>
   ....: <a href="http://example.com/page">TEXT</a>
   ....: <a href="http://dontmatchme.com/page">WRONGTEXT</a>"""

In [41]: soup = BeautifulSoup(s)

In [42]: for link in soup.findAll('a', href=True, text='TEXT'):
   ....:     print link['href']
   ....:
   ....:
http://example.com
http://example.com/link
http://example.com/page

Upvotes: 47

Related Questions