Python regex matchs not all characters wanted

Question

I have some txt-files made from pdfs and want so add some xml-tags using a little python-script and regex-patterns. Mostly it works fine but sometimes an expression matches not all the characters wanted. In the testing tool here it works right.

Here's the python-code:

matchs = re.finditer("[^<]+",string)
    for m in matchs:
        tagend = m.end()
        string = string[:tagend] + "" + string[tagend:]

The original string...

1. Regierungserklärung des MinisterpräsidentenMinisterpräsident Winfried Kretschmann

... should be transformed to:

1. Regierungserklärung des MinisterpräsidentenMinisterpräsident Winfried Kretschmann

but it returns

1. Regierungserklärung des MinisterpräsidentenMinisterpräsident Winfried Kretschmann

instead.

I would be glad to get a reply to that question. Jan

Felipe · Accepted Answer

I test it using re.sub() and the result seems to be right.

 #coding: utf-8
 import re
 input = "1. Regierungserklärung des MinisterpräsidentenMinisterpräsident Winfried Kretschmann "
 print(re.sub(r"([^<]+)","\g<1><\UTop>" ,input))

As you said regex testing tools works properly too. here

Python regex matchs not all characters wanted

Answers (2)

Related Questions