Reputation: 637
I have a very long HTML text of the following structure:
<div>
<div>
<p>Paragraph 1 Lorem ipsum dolor... long text... </p>
<p>Paragraph 2 Lorem ipsum dolor... long text... </p>
<p>Paragraph 3 Lorem ipsum dolor... long text... </p>
</div>
</div>
Now, let's say I want to trim the HTML text to just 1000 characters, but I still want the HTML to be valid, that is, close the tags whose closing tags were removed. What can I do to correct the trimmed HTML text using Python? Note that the HTML is not always structured as above.
I need this for an email campaign wherein a preview of the blog is sent but the recipient needs to visit the blog's URL to see the complete article.
Upvotes: 1
Views: 94
Reputation: 28
How about BeautifulSoup? (python-bs4)
from bs4 import BeautifulSoup
test_html = """<div>
<div>
<p>Paragraph 1 Lorem ipsum dolor... long text... </p>
<p>Paragraph 2 Lorem ipsum dolor... long text... </p>
<p>Paragraph 3 Lorem ipsum dolor... long text... </p>
</div>
</div>"""
test_html = test_html[0:50]
soup = BeautifulSoup(test_html, 'html.parser')
print(soup.prettify())
.prettify() should close the tags automatically.
Upvotes: 1
Reputation: 442
I can show an example. If it looks like this:
<div>
<p>Long text...</p>
<p>Longer text to be trimmed</p>
</div>
And you have a Python code like:
def TrimHTML(HtmlString):
result = []
newlinesremaining = 2 # or some other value of your choice
foundlastpart = False
for x in list(HtmlString): # being HtmlString the html to be trimmed
if not newlinesremaining < 1:
if x == '\n':
newlinesremaining -= 1
result.append(x)
elif foundlastpart == False:
if x == \n:
newlinesremaining = float('inf')
foundlastpart == True
return result.join('')
and you run that code inputting the example HTML above in the function, then the function returns:
<div>
<p>Long text...</p>
</div>
For some probably odd reason I couldn't test it in the short time window that I have before work.
Upvotes: 0