Remove a portion of HTML text using Python

Question

I have a very long HTML text of the following structure:


    
        Paragraph 1 Lorem ipsum dolor... long text... 
        Paragraph 2 Lorem ipsum dolor... long text... 
        Paragraph 3 Lorem ipsum dolor... long text...

Now, let's say I want to trim the HTML text to just 1000 characters, but I still want the HTML to be valid, that is, close the tags whose closing tags were removed. What can I do to correct the trimmed HTML text using Python? Note that the HTML is not always structured as above.

I need this for an email campaign wherein a preview of the blog is sent but the recipient needs to visit the blog's URL to see the complete article.

SeniorFoffo · Accepted Answer

How about BeautifulSoup? (python-bs4)

from bs4 import BeautifulSoup

test_html = """
    
        Paragraph 1 Lorem ipsum dolor... long text... 
        Paragraph 2 Lorem ipsum dolor... long text... 
        Paragraph 3 Lorem ipsum dolor... long text... 
    
"""

test_html = test_html[0:50]
soup = BeautifulSoup(test_html, 'html.parser')

print(soup.prettify())

.prettify() should close the tags automatically.

Remove a portion of HTML text using Python

Answers (2)

Related Questions