K.ahmed104
K.ahmed104

Reputation: 21

Beautiful Soup finding href based on hyperlink Text

I'm having an issue trying to get beautiful soup to find an a href with a specific title and extract the href only.

I have the code below but cant seem to make it get the href only(whatever is between the open " and close ") based on the hyperlink text found in the in that href.

res = requests.get(website_url)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
temp_tag_href = soup.select_one("a[href*=some text]")
sometexthrefonly = temp_tag_href.attrs['href']

Effectively, i would like it to go through the entire html parsed in soup and only return what is between the href open " and close " because the that hyperlink text is 'some text'.

so the steps would be:

1: parse html, 
2: look at all the a hrefs tags, 
3: find the href that has the hyperlink text 'some text', 
4: output only what is in between the href " " (not including the 
   "") for that href

Any help will greatly be appreciated!

Upvotes: 0

Views: 564

Answers (2)

Curtis Cali
Curtis Cali

Reputation: 133

ahmed,

So after some quick refreshers on requests and researching the BeautifulSoup library, I think you'll want something like the following:

res = requests.get(website_url)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
link = list(filter(lambda x: x['href'] == 'some text', soup.find_all('a')))[0]
print(link['href']) # since you don't specify output to where, I'll use stdout for simplicity

As it turns out in the Beautiful Soup Documentation there is a convenient way to access whatever attributes you want from an html element using dictionary lookup syntax. You can also do all kinds of lookups using this library.

If you are doing web scraping, it may also be useful to try switching to a library that supports XPATH, which allows you to write powerful queries such as //a[@href="some text"][1] which will get you the first link with url equal to "some text"

Upvotes: 2

Ali Yılmaz
Ali Yılmaz

Reputation: 1695

this should do the work:

from BeautifulSoup import BeautifulSoup

html = '''<a href="some_url">next</a>
<div><a href="another_url">later</a></div>
<h3><a href="yet_another_url">later</a></h3>'''

soup = BeautifulSoup(html)

# iterate all hrefs
for a in soup.find_all('a', href=True):
    print("Next HREF: %s" % a['href'])
    if a['href'] == 'some_text':
       print("Found it!")

Upvotes: 0

Related Questions