Reputation: 2033

python regular expression tag

I want to change this string

 hello world . I am playing python 

to:

<bold><bold>hello world </bold></bold>, I am playing <bold> python </bold>

I used:

import re 

pattern = re.compile(r'\<p>(.*?)\</p>|\<b>(.*?)\</b>')

print re.sub(pattern, r'<bold>\1</bold>', "<p><b>hello world</b></p>. I am playing <b> python</b>")

It does not output what I want, it complains error: unmatched group

It works in this case:

re.sub(pattern, r'<bold>\1</bold>', "<p>hello world</p>. I am playing <p> python</p>")

<bold> hello world </bold>. I am playing <bold> python</bold>

Upvotes: 0

Answers (3)

Justin Peel

Reputation: 47092

The problem is because the first group is the one within  and the second group is within  in the regexp. However, in your substitution you are referring to the first group when, if it matched to , there wasn't one. I offer a couple of solutions.

First,

>>> pattern = re.compile(r'<(p|b)>(.*?)</\1>')
>>> print re.sub(pattern, r'<bold>\2</bold>', 
                 "<p><b>hello world</b></p>. I am playing <b> python</b>")
<bold><b>hello world</b></bold>. I am playing <bold> python</bold>

will match a given pair of tags. However, as you can see, it would have to be used twice on the string because when it matched the  tags, it skipped over the nested  tags.

Here's the option that I would go with:

>>> pattern = re.compile(r'<(/?)[pb]>')
>>> print re.sub(pattern, r'<\1bold>', 
                 "<p><b>hello world</b></p>. I am playing <b> python</b>")
<bold><bold>hello world</bold></bold>. I am playing <bold> python</bold>

Upvotes: 2

Oscar Mederos

Reputation: 29863

Although I don't recommend using Regex for parsing HTML (there are libraries for that purpose in almost every language), this should work:

text = "<p><b> hello world </b></p>. I am playing <b> python </b>"

import re 

pattern1 = re.compile(r'\<p>(.*?)\</p>')
pattern2 = re.compile(r'\<b>(.*?)\</b>')

replaced = re.sub(pattern1, r'<bold>\1</bold>', text)
replaced = re.sub(pattern2, r'<bold>\1</bold>', replaced)

I think the problem you're having is because of how Python takes Groups. Test the following and you'll see what I mean:

text = "<p><b> hello world </b></p>. I am playing <b> python </b>"

import re 

pattern = re.compile(r'\<p>(.*?)\</p>|\<b>(.*?)\</b>')

for match in pattern.finditer(text):
  print match.groups()

You will see the following:

('<b> hello world </b>', None) # Here captured the 1st group
(None, ' python ') # Here the 2nd ;)

And anyway, take in count that it matched first what is between  so it took  hello world  (something you would like to match too) as the first match. Maybe changin the order of the compiled regex in pattern would solve this, but could happen the opposite (having  ... )

I wish I could provide more info, but I'm not very good in regex using Python. C# takes them differently.

Edit:
I understand you might want to do this using regex for learning/testing purpose, don't know, but in production code I would go for another alternative (like the one @Senthil gave you) or just use a HTML Parser.

Upvotes: 3

Senthil Kumaran

Reputation: 56951

If you choose not to use regex, then it simple as this:

d = {'<p>':'<bold>','</p>':'</bold>','<b>':'<bold>','</b>':'</bold>'}
s = '<p><b> hello world </b></p>. I am playing <b> python </b>'
for k,v in d.items():
    s = s.replace(k,v)

Upvotes: 5

python regular expression tag

Answers (3)

Related Questions