Amiclone
Amiclone

Reputation: 408

Get tag 'a' from beautiful soup

I have a htmml page as soup 'a'. On that that page I am interested in finding hreff under tag which contains text 'AFT'(case insensitive). On doing this:

>>> rows = a.findAll('span', attrs={'class': 'views-field views-field-title'})

The output is:

[<span class="views-field views-field-title"><span class="field-content">
<a href="/index.php/en/publications/communiques-presse/20201030-next-issuance-btfs" hreflang="en">30 October 2020: AFT’s next issuance of BTFs: Monday 02 November 2020 </a>
</span></span>, <span class="views-field views-field-title"><span class="field-content">
<a href="/index.php/en/publications/communiques-presse/20201030-next-issuance-oats" hreflang="en">30 October 2020: BFT’s next issuance of long-term OATs: Thursday 05 November 2020</a>
</span></span>, <span class="views-field views-field-title"><span class="field-content">
<a href="/index.php/en/publications/communiques-presse/20201026-issuance-btfs" hreflang="en">26 October 2020: AFT's issuance: 5.289 billion euros of BTFs</a>
</span></span>, <span class="views-field views-field-title"><span class="field-content">
<a href="/index.php/en/publications/communiques-presse/20201023-next-issuance-btfs" hreflang="en">23 October 2020: AFT’s next issuance of BTFs: Monday 26 October 2020 </a>
</span></span>, <span class="views-field views-field-title"><span class="field-content">
<a href="/index.php/en/publications/communiques-presse/20201019-issuance-btfs" hreflang="en">19 October 2020: AFT's issuance: 5.489 billion euros of BTFs</a>
</span></span>, <span class="views-field views-field-title"><span class="field-content">
<a href="/index.php/en/publications/communiques-presse/20201016-next-issuance-btfs" hreflang="en">16 October 2020: AFT’s next issuance of BTFs: Monday 19 October 2020 </a>
</span></span>, <span class="views-field views-field-title"><span class="field-content">
<a href="/index.php/en/publications/communiques-presse/20201015-next-issuance-inflation-indexed-oats" hreflang="en">15 October 2020: AFT’s issuance: 1.000 billion euros of inflation-indexed OATs</a>
</span></span>, <span class="views-field views-field-title"><span class="field-content">
<a href="/index.php/en/publications/communiques-presse/20201015-issuance-oats" hreflang="en">15 October 2020: AFT’s issuance: 7.240 billion euros of medium-term OATs</a>
</span></span>, <span class="views-field views-field-title"><span class="field-content">
<a href="/index.php/en/publications/communiques-presse/20201012-issuance-btfs" hreflang="en">12 October 2020: AFT's issuance: 5.288 billion euros of BTFs</a>
</span></span>, <span class="views-field views-field-title"><span class="field-content">
<a href="/index.php/en/publications/communiques-presse/20201009-next-issuance-indexed-oats" hreflang="en">09 October 2020: AFT’s next issuance of inflation-indexed OATs: Thursday 15 October 2020</a>
</span></span>, <span class="views-field views-field-title"><span class="field-content">
<a href="/index.php/en/publications/communiques-presse/20201009-next-issuance-btfs" hreflang="en">09 October 2020: AFT’s next issuance of BTFs: Monday 12 October 2020 </a>
</span></span>, <span class="views-field views-field-title"><span class="field-content">
<a href="/index.php/en/publications/communiques-presse/20201009-next-issuance-oats" hreflang="en">09 October 2020: AFT’s next issuance of medium-term OATs: Thursday 15 October 2020</a>
</span></span>]

So from above I want all hreff except the one inside this(2nd element of list) because it does not contain 'AFT'

<span class="views-field views-field-title"><span class="field-content">
<a href="/index.php/en/publications/communiques-presse/20201030-next-issuance-oats" hreflang="en">30 October 2020: BFT’s next issuance of long-term OATs: Thursday 05 November 2020</a>
</span></span>

Could someone help in extracting the hreff as a list from rows or may from a? Thanks.

Upvotes: 0

Views: 86

Answers (3)

MendelG
MendelG

Reputation: 20098

To find href’s which contain AFT, you can use a CSS Selector contains(<my text>):

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_snippet, "html.parser")

# Select the class `views-field views-field-title` and `a` which contains the text `AFT`
for tag in soup.select(".views-field.views-field-title a:contains(AFT)"):
    print(tag['href'])
  

Output:

/index.php/en/publications/communiques-presse/20201030-next-issuance-btfs
/index.php/en/publications/communiques-presse/20201026-issuance-btfs
/index.php/en/publications/communiques-presse/20201023-next-issuance-btfs
/index.php/en/publications/communiques-presse/20201019-issuance-btfs
/index.php/en/publications/communiques-presse/20201016-next-issuance-btfs
/index.php/en/publications/communiques-presse/20201015-next-issuance-inflation-indexed-oats
/index.php/en/publications/communiques-presse/20201015-issuance-oats
/index.php/en/publications/communiques-presse/20201012-issuance-btfs
/index.php/en/publications/communiques-presse/20201009-next-issuance-indexed-oats
/index.php/en/publications/communiques-presse/20201009-next-issuance-btfs
/index.php/en/publications/communiques-presse/20201009-next-issuance-oats

Upvotes: 0

buran
buran

Reputation: 14273

href = [row.find('a').get('href') for row in rows if 'AFT' in row.text]
print(href)

output

['/index.php/en/publications/communiques-presse/20201030-next-issuance-btfs',
 '/index.php/en/publications/communiques-presse/20201026-issuance-btfs',
 '/index.php/en/publications/communiques-presse/20201023-next-issuance-btfs',
 '/index.php/en/publications/communiques-presse/20201019-issuance-btfs',
 '/index.php/en/publications/communiques-presse/20201016-next-issuance-btfs',
 '/index.php/en/publications/communiques-presse/20201015-next-issuance-inflation-indexed-oats',
 '/index.php/en/publications/communiques-presse/20201015-issuance-oats',
 '/index.php/en/publications/communiques-presse/20201012-issuance-btfs',
 '/index.php/en/publications/communiques-presse/20201009-next-issuance-indexed-oats',
 '/index.php/en/publications/communiques-presse/20201009-next-issuance-btfs',
 '/index.php/en/publications/communiques-presse/20201009-next-issuance-oats']

Upvotes: 1

sytech
sytech

Reputation: 41119

You can write custom finder functions for your needs.

def aft_tag(tag):
    return tag.get('href') and 'AFT' in tag.text

for tag in soup.find_all(aft_tag):
    print(tag.get('href'))

Another way to write this would be:

for row in a.findAll('span', attrs={'class': 'views-field views-field-title'}):
    anchor = row.find('a')
    if 'AFT' in anchor:
        print(anchor.get('href'))

Upvotes: 0

Related Questions