cullan
cullan

Reputation: 280

Python beautifulsoup to remove all tags/content with specific tag and text following

I am using beautifulsoup in python and want to remove everything from a string that are enclosed in a certain tag and have a specific non-closing tag with specific text following it. In this example, I want to remove all the documents that have type tag inside it with text of DOCA.

Let's say I have something like this:

<body>
    <document>
        <type>DOCA
            <sequence>1
            <filename>DOCA.htm
            <description>FORM DOCA
            <text>
                <title>Form DOCA</title>
                <h5 align="left"><a href="#toc">Table of Contents</a></h5>
    </document>
    <document>
        <type>DOCB
        <sequence>1
        <filename>DOCB.htm
        <description>FORM DOCB
        <text>
            <title>Form DOCB</title>
            <h5 align="left"><a href="#toc">Table of Contents</a></h5>
    </document>
<body>

What I want to do is remove all <document> that have a <type> of DOCA. I have tried the following but it doesn't work:

>>print(soup.find('document').find('type', text = re.compile('DOCA.*')))
None

Upvotes: 1

Views: 2619

Answers (2)

t.m.adam
t.m.adam

Reputation: 15376

You can use a lambda in the find method to select an element, eg:

soup.find('document').find(lambda tag : tag.name == 'type' and 'DOCA' in tag.text)  

Then you can use extract or decompose to remove that element.

Edit: use this expression to select all elements:

soup.find_all(lambda tag:tag.name == 'document' 
    and tag.find(lambda t:t.name == 'type' and 'DOCA' in t.text))

Upvotes: 3

cs95
cs95

Reputation: 403248

You can query all documents and then, within each document, query all types, check to see if DOCA exists in any of them, and delete the entire enclosing document if it does.

from bs4 import BeautifulSoup

soup = BeautifulSoup(..., 'html.parser')

for doc in soup.find_all('document'):
    for type in doc.find_all('type'):
        if 'DOCA' in type.text:
            doc.extract()
            break

print(soup)

Output:

<body>

<document>
<type>DOCB
        <sequence>1
        <filename>DOCB.htm
        <description>FORM DOCB
        <text>
<title>Form DOCB</title>
<h5 align="left"><a href="#toc">Table of Contents</a></h5>
</text></description></filename></sequence></type></document>
</body>

Upvotes: 3

Related Questions