chnet
chnet

Reputation: 2033

python regular expression tag

I want to change this string

<p><b> hello world </b></p>. I am playing <b> python </b>

to:

<bold><bold>hello world </bold></bold>, I am playing <bold> python </bold>

I used:

import re 

pattern = re.compile(r'\<p>(.*?)\</p>|\<b>(.*?)\</b>')

print re.sub(pattern, r'<bold>\1</bold>', "<p><b>hello world</b></p>. I am playing <b> python</b>")

It does not output what I want, it complains error: unmatched group

It works in this case:

re.sub(pattern, r'<bold>\1</bold>', "<p>hello world</p>. I am playing <p> python</p>")

<bold> hello world </bold>. I am playing <bold> python</bold>

Upvotes: 0

Views: 5688

Answers (3)

Justin Peel
Justin Peel

Reputation: 47092

The problem is because the first group is the one within <p></p> and the second group is within <b></b> in the regexp. However, in your substitution you are referring to the first group when, if it matched to <b></b>, there wasn't one. I offer a couple of solutions.

First,

>>> pattern = re.compile(r'<(p|b)>(.*?)</\1>')
>>> print re.sub(pattern, r'<bold>\2</bold>', 
                 "<p><b>hello world</b></p>. I am playing <b> python</b>")
<bold><b>hello world</b></bold>. I am playing <bold> python</bold>

will match a given pair of tags. However, as you can see, it would have to be used twice on the string because when it matched the <p></p> tags, it skipped over the nested <b></b> tags.

Here's the option that I would go with:

>>> pattern = re.compile(r'<(/?)[pb]>')
>>> print re.sub(pattern, r'<\1bold>', 
                 "<p><b>hello world</b></p>. I am playing <b> python</b>")
<bold><bold>hello world</bold></bold>. I am playing <bold> python</bold>

Upvotes: 2

Oscar Mederos
Oscar Mederos

Reputation: 29863

Although I don't recommend using Regex for parsing HTML (there are libraries for that purpose in almost every language), this should work:

text = "<p><b> hello world </b></p>. I am playing <b> python </b>"

import re 

pattern1 = re.compile(r'\<p>(.*?)\</p>')
pattern2 = re.compile(r'\<b>(.*?)\</b>')

replaced = re.sub(pattern1, r'<bold>\1</bold>', text)
replaced = re.sub(pattern2, r'<bold>\1</bold>', replaced)

I think the problem you're having is because of how Python takes Groups. Test the following and you'll see what I mean:

text = "<p><b> hello world </b></p>. I am playing <b> python </b>"

import re 

pattern = re.compile(r'\<p>(.*?)\</p>|\<b>(.*?)\</b>')

for match in pattern.finditer(text):
  print match.groups()

You will see the following:

('<b> hello world </b>', None) # Here captured the 1st group
(None, ' python ') # Here the 2nd ;)

And anyway, take in count that it matched first what is between <p></p> so it took <b> hello world </b> (something you would like to match too) as the first match. Maybe changin the order of the compiled regex in pattern would solve this, but could happen the opposite (having <b><p> ... </p></b>)

I wish I could provide more info, but I'm not very good in regex using Python. C# takes them differently.

Edit:
I understand you might want to do this using regex for learning/testing purpose, don't know, but in production code I would go for another alternative (like the one @Senthil gave you) or just use a HTML Parser.

Upvotes: 3

Senthil Kumaran
Senthil Kumaran

Reputation: 56951

If you choose not to use regex, then it simple as this:

d = {'<p>':'<bold>','</p>':'</bold>','<b>':'<bold>','</b>':'</bold>'}
s = '<p><b> hello world </b></p>. I am playing <b> python </b>'
for k,v in d.items():
    s = s.replace(k,v)

Upvotes: 5

Related Questions