Reputation: 3928
Sorry, another python newbie question. I have a string:
my_string = "<p>this is some \n fun</p>And this is \n some more fun!"
I would like:
my_string = "<p>this is some fun</p>And this is \n some more fun!"
In other words, how do I get rid of '\n' only if it occurs inside an html tag?
I have:
my_string = re.sub('<(.*?)>(.*?)\n(.*?)</(.*?)>', 'replace with what???', my_string)
Which obviously won't work, but I'm stuck.
Upvotes: 2
Views: 1872
Reputation: 1108
You should try using BeautifulSoup (bs4
), this will allow you to parse XML tags and pages.
>>> import bs4
>>> my_string = "<p>this is some \n fun</p>And this is \n some more fun!"
>>> soup = bs4.BeautifulSoup(my_string)
>>> p = soup.p.contents[0].replace('\n ','')
>>> print p
This will pull out the new line in the p tag. If the content has more than one tag, None
can be used as well as a for loop, then gathering the children (using the tag.child
property).
For example:
>>> tags = soup.find_all(None)
>>> for tag in tags:
... if tag.child is None:
... tag.child.contents[0].replace('\n ', '')
... else:
... tag.contents[0].replace('\n ', '')
Though, this might not work exactly the way you want it (as web pages can vary), this code can be reproduced for your needs.
Upvotes: 2
Reputation: 213578
Regular expressions are a bad match for HTML. Don't do it. See RegEx match open tags except XHTML self-contained tags.
Instead, use an HTML parser. Python ships with html.parser, or you can use Beautiful Soup or html5lib. All you have to do then is walk the tree and remove line breaks.
Upvotes: 5