I have field "body" in my table (mysql) and there a lot of entries like: A lot of spaces, new line, &nbsp, etc. How to remove it? This not work: text.replace(' ', '') text.replace('\n', '')

Reputation: 13880

How to remove unnecessary tags?

I have field "body" in my table (mysql) and there a lot of entries like:

</p><p>  &nbsp;</p><p>

</p><p> 
   </p><p>

A lot of spaces, new line, &nbsp, etc. How to remove it?

This not work:

text.replace('</p><p>&nbsp;</p><p>', '</p><p>')
text.replace('</p><p>\n</p><p>', '</p><p>')

Upvotes: 1

Answers (5)

subiet

Reputation: 1399

What @Jurlie Suggested is a Good approach. Consider using BeautifulSoup for this purpouse. It is a very mature and robust library.

Upvotes: 1

Niek de Klein

Reputation: 8834

text.strip('>&nbsp;').strip(' ').strip('\n').strip('\t')

Upvotes: 0

San4ez

Reputation: 8251

Try this regexp:

>>> import re
>>> text = '''</p><p>  &nbsp;</p><p>
... 
... </p><p> 
...    </p><p>
... '''
>>> re.sub(r'<p>(?:&nbsp;|\s|<br \/>)*?</p>\s*', '', text)
'</p><p>\n'

Upvotes: 0

Jurlie

Reputation: 1014

I would parse such a file in a syntax tree, and then removed there empty leaves. Then would generate the HTML file again. Unfortunately I'm not working in python, I cannot specify the helpful libraries for this.

Upvotes: 1

Roman Bodnarchuk

Reputation: 29737

text = ''.join(text.split()) - after that you can continue with your replacements.

Upvotes: 2

How to remove unnecessary tags?

Answers (5)

Related Questions