Sainita
Sainita

Reputation: 362

How to remove parent element in BeautifulSoup?

Given this html structure

<strong><a href="https://www.fertilizer.com/2021/07/bvfcl.html" target="_blank">Fertilizer Corporation Limited</a> (BVFCL)</strong> has released an employment notification for the recruitment of <strong>11 DGM, Company Secretary, Finance Manager and Accounts Officer Vacancy</strong> 

I need to remove the entire element/tag if the html structure has fertilizer.com in it

So that the final results should be:

null

I learned there is a decompose() method in bs4 to remove elements, but how to do it for the parent element, how to navigate to it.

Please guide me. Thanks

Upvotes: 0

Views: 661

Answers (1)

Federico Ba&#249;
Federico Ba&#249;

Reputation: 7735

Given the only provided piece of HTML, this would be my solution

from bs4 import BeautifulSoup

txt = '''
<strong>
    <a href="https://www.fertilizer.com/2021/07/bvfcl.html" target="_blank">Fertilizer Corporation Limited</a> (BVFCL)
</strong> 
has released an employment notification for the recruitment of 
<strong>11 DGM, Company Secretary, Finance Manager and Accounts Officer Vacancy
</strong> 
'''

soup = BeautifulSoup(txt, 'html.parser')
print(f'Content Before decomposition:\n{soup}')
target = "www.fertilizer.com"
hrefs = [link['href'] for link in soup.find_all('a', href=True) if target in link['href']]
print(hrefs) # ['https://www.fertilizer.com/2021/07/bvfcl.html']
if hrefs: # Means we found it
    soup.decompose()
print(f'Content After decomposition:\n{soup}')
# <None></None>

Another solution, in case you just want to get nothing back, is the following; note that the second loop, is to remove the free text which is not inclosed in a particular tag

from bs4 import BeautifulSoup


txt = '''
<strong>
    <a href="https://www.fertilizer.com/2021/07/bvfcl.html" target="_blank">Fertilizer Corporation Limited</a> (BVFCL)
</strong> 
has released an employment notification for the recruitment of 
<strong>11 DGM, Company Secretary, Finance Manager and Accounts Officer Vacancy
</strong> 
'''

soup = BeautifulSoup(txt, 'html.parser')

print(f'Content Before decomposition:\n{soup}')
target = "www.fertilizer.com"
hrefs = [link['href'] for link in soup.find_all('a', href=True) if target in link['href']]
print(hrefs) # ['https://www.fertilizer.com/2021/07/bvfcl.html']
if hrefs: # Means we found it
    # Handles tags
    for el in soup.find_all():
        el.replaceWith("")
    # Handles free text like: 'has released an employment notification for the recruitment of ' (bevause is not in a particular tag) 
    for el in soup.find_all(text=True):
        el.replaceWith("")
print(f'Content After decomposition:\n{soup}')

Related Documentation

Upvotes: 1

Related Questions