Jan Seipel
Jan Seipel

Reputation: 119

Python regex matchs not all characters wanted

I have some txt-files made from pdfs and want so add some xml-tags using a little python-script and regex-patterns. Mostly it works fine but sometimes an expression matches not all the characters wanted. In the testing tool here it works right.

Here's the python-code:

matchs = re.finditer("<UTop>[^<]+",string)
    for m in matchs:
        tagend = m.end()
        string = string[:tagend] + "</UTop>" + string[tagend:]

The original string...

<Top>1. Regierungserklärung des Ministerpräsidenten<UTop>Ministerpräsident Winfried Kretschmann </Top>

... should be transformed to:

<Top>1. Regierungserklärung des Ministerpräsidenten<UTop>Ministerpräsident Winfried Kretschmann </UTop></Top>

but it returns

<Top>1. Regierungserklärung des Ministerpräsidenten<UTop>Ministerpräsident Winfried Krets</UTop>chmann </Top>

instead.

I would be glad to get a reply to that question. Jan

Upvotes: 0

Views: 66

Answers (2)

Felipe
Felipe

Reputation: 213

I test it using re.sub() and the result seems to be right.

 #coding: utf-8
 import re
 input = "<Top>1. Regierungserklärung des Ministerpräsidenten<UTop>Ministerpräsident Winfried Kretschmann </Top>"
 print(re.sub(r"(<UTop>[^<]+)","\g<1><\\UTop>" ,input))

As you said regex testing tools works properly too. here

Upvotes: 1

Jan
Jan

Reputation: 43169

Use the Unicode flag:

matchs = re.finditer("<UTop>[^<]+",string,re.UNICODE)

For HTML consider using BeautifulSoup instead.

Upvotes: 1

Related Questions