Reputation: 362
Given this html structure
<strong><a href="https://www.fertilizer.com/2021/07/bvfcl.html" target="_blank">Fertilizer Corporation Limited</a> (BVFCL)</strong> has released an employment notification for the recruitment of <strong>11 DGM, Company Secretary, Finance Manager and Accounts Officer Vacancy</strong>
I need to remove the entire element/tag if the html structure has fertilizer.com
in it
So that the final results should be:
null
I learned there is a decompose()
method in bs4 to remove elements, but how to do it for the parent element, how to navigate to it.
Please guide me. Thanks
Upvotes: 0
Views: 661
Reputation: 7735
Given the only provided piece of HTML, this would be my solution
from bs4 import BeautifulSoup
txt = '''
<strong>
<a href="https://www.fertilizer.com/2021/07/bvfcl.html" target="_blank">Fertilizer Corporation Limited</a> (BVFCL)
</strong>
has released an employment notification for the recruitment of
<strong>11 DGM, Company Secretary, Finance Manager and Accounts Officer Vacancy
</strong>
'''
soup = BeautifulSoup(txt, 'html.parser')
print(f'Content Before decomposition:\n{soup}')
target = "www.fertilizer.com"
hrefs = [link['href'] for link in soup.find_all('a', href=True) if target in link['href']]
print(hrefs) # ['https://www.fertilizer.com/2021/07/bvfcl.html']
if hrefs: # Means we found it
soup.decompose()
print(f'Content After decomposition:\n{soup}')
# <None></None>
Another solution, in case you just want to get nothing back, is the following; note that the second loop, is to remove the free text which is not inclosed in a particular tag
from bs4 import BeautifulSoup
txt = '''
<strong>
<a href="https://www.fertilizer.com/2021/07/bvfcl.html" target="_blank">Fertilizer Corporation Limited</a> (BVFCL)
</strong>
has released an employment notification for the recruitment of
<strong>11 DGM, Company Secretary, Finance Manager and Accounts Officer Vacancy
</strong>
'''
soup = BeautifulSoup(txt, 'html.parser')
print(f'Content Before decomposition:\n{soup}')
target = "www.fertilizer.com"
hrefs = [link['href'] for link in soup.find_all('a', href=True) if target in link['href']]
print(hrefs) # ['https://www.fertilizer.com/2021/07/bvfcl.html']
if hrefs: # Means we found it
# Handles tags
for el in soup.find_all():
el.replaceWith("")
# Handles free text like: 'has released an employment notification for the recruitment of ' (bevause is not in a particular tag)
for el in soup.find_all(text=True):
el.replaceWith("")
print(f'Content After decomposition:\n{soup}')
Upvotes: 1