How do I remove linebreaks ONLY if they occur inside html tags?

Question

Sorry, another python newbie question. I have a string:

my_string = "this is some 
 funAnd this is 
 some more fun!"

I would like:

my_string = "this is some funAnd this is 
 some more fun!"

In other words, how do I get rid of ' ' only if it occurs inside an html tag?

I have:

my_string = re.sub('<(.*?)>(.*?)
(.*?)', 'replace with what???', my_string)

Which obviously won't work, but I'm stuck.

Hairr · Accepted Answer

You should try using BeautifulSoup (bs4), this will allow you to parse XML tags and pages.

>>> import bs4
>>> my_string = "this is some 
 funAnd this is 
 some more fun!"
>>> soup = bs4.BeautifulSoup(my_string)
>>> p = soup.p.contents[0].replace('
 ','')
>>> print p

This will pull out the new line in the p tag. If the content has more than one tag, None can be used as well as a for loop, then gathering the children (using the tag.child property).

For example:

>>> tags = soup.find_all(None)
>>> for tag in tags:
...    if tag.child is None:
...        tag.child.contents[0].replace('
 ', '')
...    else:
...        tag.contents[0].replace('
 ', '')

Though, this might not work exactly the way you want it (as web pages can vary), this code can be reproduced for your needs.

How do I remove linebreaks ONLY if they occur inside html tags?

Answers (2)

Related Questions