hllau
hllau

Reputation: 10569

How to strip leading or trailing spaces, <br>, empty <div>, empty <p> or alike in Python easily?

I have paragraphs input from users. However, there are always leading or trailing <br>, empty <p>, or empty <div> which are meaningless at all, and they affect the formatting of the output. How do I strip them in Python nicely and correctly?

An example of a user input is as below:

<br><div></div>
<div>Hello <a href="world.html">World!</a>.</div>
<br><br>
<div>Image below:<br>
<img src="abc.jpg" /><br><br></div><p></p>

And the ideal result that I want is:

<div>Hello <a href="world.html">World!</a>.</div>
<br /><br />
<div>Image below:<br />
<img src="abc.jpg" /></div>

Thank you.

Upvotes: 0

Views: 855

Answers (2)

Sunflowerandcat
Sunflowerandcat

Reputation: 15

Try this function:

get_text('', '<br/>')

I have the same problem of breaking text into multiple lines by the tag

'<br/>' 

This function could at least join the lines into a single line, which could remove the effect of this tag. Hope that works!

Upvotes: 0

Eduardo Ivanec
Eduardo Ivanec

Reputation: 11862

If I understood you correctly this time around you could try removing empty tags - that is, tags which have no text:

>>> from BeautifulSoup import BeautifulSoup as bs
>>> tags = bs('<div></div><p></p><div>Test text.</div><p></p>').findAll()
>>> [ tag for tag in tags if tag.text ]
[<div>Test text.</div>]

Upvotes: 2

Related Questions