Ozgur Vatansever
Ozgur Vatansever

Reputation: 52183

Python Replace an Undetermined length of Text

I have a string like this:

Hi. My name is _John_. I am _20_ years old.

and I'd like to convert it into this:

Hi. My name is <b>John</b>. I am <b>20</b> years old.

I did something like this but no luck.

import re
text = "Hi. My name is _John_. I am _20_ years old."
pattern = "(.*)(\_)(.*)(\_)(.*)"
re.sub(pattern, r'\1<b>\3</b>\5', text)
'Hi. My name is _John_. I am <b>20</b> years old.'

What is wrong with the pattern? Why is it not seeing the first bold text?

Any help would be appreciated. Thanks.

Upvotes: 2

Views: 429

Answers (6)

Srikar Appalaraju
Srikar Appalaraju

Reputation: 73658

Have you tried using String Templates ? They were build for something like this. Simple String substitutions. Hell of a lot cleaner & elegant than using regexes...

import string

new_style = string.Template('Hi. My name is $name. I am $age years old.')
print new_style % {'name':'<b>John</b>', 'age':'<b>20</b>'} #produces what u want.

For more on string template examples check this activeState link

Upvotes: 3

Johan Lundberg
Johan Lundberg

Reputation: 27038

The problem is that your first .*in the pattern is eating everything to the left of the last possible match. It is therefore said that * is greedy. Use a non-greedy pattern

pattern='_(.+?)_'
re.sub(pattern, r'<b>\1</b>', text)

? makes the match non-greedy; as short as possible. + required at east one character between the two underscores in order for it to be replaced with <b>text</b>. So __ will remain __

If you would like __ to become <b></b> then use .*?

Upvotes: 3

Burhan Khalid
Burhan Khalid

Reputation: 174662

This sounds remarkably like markdown syntax, so if your goal is to parse that, there already exists a python library.

Upvotes: 1

scessor
scessor

Reputation: 16115

Change to:

pattern = "_([^_]*)_"
re.sub(pattern, r'<b>\1</b>', text)

Also see this example.

Upvotes: 4

jcollado
jcollado

Reputation: 40414

The problem is that * is greedy and consumes as many characters as possible (including more _). To fix that, you can use the non-greedy alternative *? as follows:

>>> pattern = r'_(.*?)_'
>>> replacement = r'<b>\1</b>'
>>> re.sub(pattern ,replacement, text)
'Hi. My name is <b>John</b>. I am <b>20</b> years old.'

Note that re.sub behaves like re.search instead of re.match. That is, you can use a pattern that just partially matches the input (in this case, just some text surrounded by _) instead of something that matches the whole line.

Upvotes: 4

Jakub Roztocil
Jakub Roztocil

Reputation: 16252

It's because the pattern is greedy and the first (.*) matches the text from the beginning all the way to the third _:

>>> re.match(pattern, text).groups()
('Hi. My name is _John_. I am ', '_', '20', '_', ' years old.')

Here is a simplified, non-greedy version:

>>> re.sub('_(.+?)_', r'<b>\1</b>', text)
'Hi. My name is <b>John</b>. I am <b>20</b> years old.'

Upvotes: 2

Related Questions