Reputation: 13880
I have field "body" in my table (mysql) and there a lot of entries like:
</p><p> </p><p>
</p><p>
</p><p>
A lot of spaces, new line,  , etc. How to remove it?
This not work:
text.replace('</p><p> </p><p>', '</p><p>')
text.replace('</p><p>\n</p><p>', '</p><p>')
Upvotes: 1
Views: 203
Reputation: 1399
What @Jurlie Suggested is a Good approach. Consider using BeautifulSoup for this purpouse. It is a very mature and robust library.
Upvotes: 1
Reputation: 8251
Try this regexp:
>>> import re
>>> text = '''</p><p> </p><p>
...
... </p><p>
... </p><p>
... '''
>>> re.sub(r'<p>(?: |\s|<br \/>)*?</p>\s*', '', text)
'</p><p>\n'
Upvotes: 0
Reputation: 1014
I would parse such a file in a syntax tree, and then removed there empty leaves. Then would generate the HTML file again. Unfortunately I'm not working in python, I cannot specify the helpful libraries for this.
Upvotes: 1
Reputation: 29737
text = ''.join(text.split())
- after that you can continue with your replacements.
Upvotes: 2