Reputation: 2033
I want to change this string
<p><b> hello world </b></p>. I am playing <b> python </b>
to:
<bold><bold>hello world </bold></bold>, I am playing <bold> python </bold>
I used:
import re
pattern = re.compile(r'\<p>(.*?)\</p>|\<b>(.*?)\</b>')
print re.sub(pattern, r'<bold>\1</bold>', "<p><b>hello world</b></p>. I am playing <b> python</b>")
It does not output what I want, it complains error: unmatched group
It works in this case:
re.sub(pattern, r'<bold>\1</bold>', "<p>hello world</p>. I am playing <p> python</p>")
<bold> hello world </bold>
. I am playing <bold> python</bold>
Upvotes: 0
Views: 5688
Reputation: 47092
The problem is because the first group is the one within <p></p>
and the second group is within <b></b>
in the regexp. However, in your substitution you are referring to the first group when, if it matched to <b></b>
, there wasn't one. I offer a couple of solutions.
First,
>>> pattern = re.compile(r'<(p|b)>(.*?)</\1>')
>>> print re.sub(pattern, r'<bold>\2</bold>',
"<p><b>hello world</b></p>. I am playing <b> python</b>")
<bold><b>hello world</b></bold>. I am playing <bold> python</bold>
will match a given pair of tags. However, as you can see, it would have to be used twice on the string because when it matched the <p></p>
tags, it skipped over the nested <b></b>
tags.
Here's the option that I would go with:
>>> pattern = re.compile(r'<(/?)[pb]>')
>>> print re.sub(pattern, r'<\1bold>',
"<p><b>hello world</b></p>. I am playing <b> python</b>")
<bold><bold>hello world</bold></bold>. I am playing <bold> python</bold>
Upvotes: 2
Reputation: 29863
Although I don't recommend using Regex for parsing HTML (there are libraries for that purpose in almost every language), this should work:
text = "<p><b> hello world </b></p>. I am playing <b> python </b>"
import re
pattern1 = re.compile(r'\<p>(.*?)\</p>')
pattern2 = re.compile(r'\<b>(.*?)\</b>')
replaced = re.sub(pattern1, r'<bold>\1</bold>', text)
replaced = re.sub(pattern2, r'<bold>\1</bold>', replaced)
I think the problem you're having is because of how Python takes Groups. Test the following and you'll see what I mean:
text = "<p><b> hello world </b></p>. I am playing <b> python </b>"
import re
pattern = re.compile(r'\<p>(.*?)\</p>|\<b>(.*?)\</b>')
for match in pattern.finditer(text):
print match.groups()
You will see the following:
('<b> hello world </b>', None) # Here captured the 1st group
(None, ' python ') # Here the 2nd ;)
And anyway, take in count that it matched first what is between <p></p>
so it took <b> hello world </b>
(something you would like to match too) as the first match. Maybe changin the order of the compiled regex in pattern
would solve this, but could happen the opposite (having <b><p> ... </p></b>
)
I wish I could provide more info, but I'm not very good in regex using Python. C# takes them differently.
Edit:
I understand you might want to do this using regex for learning/testing purpose, don't know, but in production code I would go for another alternative (like the one @Senthil gave you) or just use a HTML Parser.
Upvotes: 3
Reputation: 56951
If you choose not to use regex, then it simple as this:
d = {'<p>':'<bold>','</p>':'</bold>','<b>':'<bold>','</b>':'</bold>'}
s = '<p><b> hello world </b></p>. I am playing <b> python </b>'
for k,v in d.items():
s = s.replace(k,v)
Upvotes: 5