Reputation: 280
I am using beautifulsoup in python and want to remove everything from a string that are enclosed in a certain tag and have a specific non-closing tag with specific text following it. In this example, I want to remove all the documents that have type tag inside it with text of DOCA.
Let's say I have something like this:
<body>
<document>
<type>DOCA
<sequence>1
<filename>DOCA.htm
<description>FORM DOCA
<text>
<title>Form DOCA</title>
<h5 align="left"><a href="#toc">Table of Contents</a></h5>
</document>
<document>
<type>DOCB
<sequence>1
<filename>DOCB.htm
<description>FORM DOCB
<text>
<title>Form DOCB</title>
<h5 align="left"><a href="#toc">Table of Contents</a></h5>
</document>
<body>
What I want to do is remove all <document>
that have a <type>
of DOCA. I have tried the following but it doesn't work:
>>print(soup.find('document').find('type', text = re.compile('DOCA.*')))
None
Upvotes: 1
Views: 2619
Reputation: 15376
You can use a lambda
in the find
method to select an element, eg:
soup.find('document').find(lambda tag : tag.name == 'type' and 'DOCA' in tag.text)
Then you can use extract
or decompose
to remove that element.
Edit: use this expression to select all elements:
soup.find_all(lambda tag:tag.name == 'document'
and tag.find(lambda t:t.name == 'type' and 'DOCA' in t.text))
Upvotes: 3
Reputation: 403248
You can query all documents and then, within each document, query all types, check to see if DOCA
exists in any of them, and delete the entire enclosing document if it does.
from bs4 import BeautifulSoup
soup = BeautifulSoup(..., 'html.parser')
for doc in soup.find_all('document'):
for type in doc.find_all('type'):
if 'DOCA' in type.text:
doc.extract()
break
print(soup)
Output:
<body>
<document>
<type>DOCB
<sequence>1
<filename>DOCB.htm
<description>FORM DOCB
<text>
<title>Form DOCB</title>
<h5 align="left"><a href="#toc">Table of Contents</a></h5>
</text></description></filename></sequence></type></document>
</body>
Upvotes: 3