Reputation: 10569
I have paragraphs input from users. However, there are always leading or trailing <br>
, empty <p>
, or empty <div>
which are meaningless at all, and they affect the formatting of the output. How do I strip them in Python nicely and correctly?
An example of a user input is as below:
<br><div></div>
<div>Hello <a href="world.html">World!</a>.</div>
<br><br>
<div>Image below:<br>
<img src="abc.jpg" /><br><br></div><p></p>
And the ideal result that I want is:
<div>Hello <a href="world.html">World!</a>.</div>
<br /><br />
<div>Image below:<br />
<img src="abc.jpg" /></div>
Thank you.
Upvotes: 0
Views: 855
Reputation: 15
Try this function:
get_text('', '<br/>')
I have the same problem of breaking text into multiple lines by the tag
'<br/>'
This function could at least join the lines into a single line, which could remove the effect of this tag. Hope that works!
Upvotes: 0
Reputation: 11862
If I understood you correctly this time around you could try removing empty tags - that is, tags which have no text:
>>> from BeautifulSoup import BeautifulSoup as bs
>>> tags = bs('<div></div><p></p><div>Test text.</div><p></p>').findAll()
>>> [ tag for tag in tags if tag.text ]
[<div>Test text.</div>]
Upvotes: 2