foosion
foosion

Reputation: 7918

python regular expression replace

I'm trying to change a string that contains substrings such as

the</span></p>
<p><span class=font7>currency

to

the currency

At the line break is CRLF

The words before and after the code change. I only want to replace if the second word starts with a lower case letter. The only thing that changes in the code is the digit after 'font'

I tried:

p = re.compile('</span></p>\r\n<p><span class=font\d>([a-z])')
res = p.sub(' \1', data)

but this isn't working

How should I fix this?

Upvotes: 2

Views: 342

Answers (3)

FailedDev
FailedDev

Reputation: 26940

This :

result = re.sub("(?si)(.*?)</?[A-Z][A-Z0-9]*[^>]*>.*</?[A-Z][A-Z0-9]*[^>]*>(.*)", r"\1 \2", subject)

Applied to :

the</span></p>
<p><span class=font7>currency

Produces :

the currency

Although I would strongly suggest against using regex with xml/html/xhtml. THis generic regex will remove all elements and capture any text before / after to groups 1,2.

Upvotes: 1

heltonbiker
heltonbiker

Reputation: 27615

I think you should use the flag re.DOTALL, which means it will "see" nonprintable characters, such as linebreaks, as if they were regular characters.

So, first line of your code would become :

p = re.compile('</span></p>..<p><span class=font\d>([a-z])', re.DOTALL)

(not the two unescaped dots instead of the linebreak).

Actually, there is also re.MULTILINE, everytime I have a problem like this one of those end up solving the problem.

Hope it helps.

Upvotes: 1

Peter Graham
Peter Graham

Reputation: 11701

Use a lookahead assertion.

p = re.compile('</span></p>\r\n<p><span class=font\d>(?=[a-z])')
res = p.sub(' ', data)

Upvotes: 1

Related Questions