Reputation: 3872
I am trying to use beautiful soup to parse html and find all href with a specific anchor tag
<a href="http://example.com">TEXT</a>
<a href="http://example.com/link">TEXT</a>
<a href="http://example.com/page">TEXT</a>
all the links I am looking for have the exact same anchor text, in this case TEXT. I am NOT looking for the word TEXT, I want to use the word TEXT to find all the different HREF.
For clarification looking for something similar to using the class to parse for the links
<a href="http://example.com" class="visible">TEXT</a>
<a href="http://example.com/link" class="visible">TEXT</a>
<a href="http://example.com/page" class="visible">TEXT</a>
and then using
findAll('a', 'visible')
except the HTML I am parsing doesn't have a class but always the same anchor text.
Upvotes: 24
Views: 58613
Reputation: 23011
Since BeautifulSoup 4.4.0, text=
parameter has been deprecated in favor of string=
. So to find all anchor tags with a specific text, you can use the following:
[elm['href'] for elm in soup.find_all("a", string='TEXT')]
The above check filters tags where the string matches exactly. If you have other conditions such as the anchor text has to start with a specific string, you can also pass regex or a function that filters for that:
# filter anchor tags whose text starts with `TEXT`
import re
[elm['href'] for elm in soup.find_all("a", string=re.compile("^TEXT"))]
# or a plain string check
[elm['href'] for elm in soup.find_all("a", string=lambda x: x.startswith('TEXT'))]
Finally, since .find_all
or .select
return a ResultSet object which is essentially a Python list, you can just filter its result using an if statement:
[elm['href'] for elm in soup.find_all("a") if elm.string == 'TEXT']
Upvotes: 1
Reputation: 37249
Would something like this work?
In [39]: from bs4 import BeautifulSoup
In [40]: s = """\
....: <a href="http://example.com">TEXT</a>
....: <a href="http://example.com/link">TEXT</a>
....: <a href="http://example.com/page">TEXT</a>
....: <a href="http://dontmatchme.com/page">WRONGTEXT</a>"""
In [41]: soup = BeautifulSoup(s)
In [42]: for link in soup.findAll('a', href=True, text='TEXT'):
....: print link['href']
....:
....:
http://example.com
http://example.com/link
http://example.com/page
Upvotes: 47