Takkun
Takkun

Reputation: 6371

How to remove characters found from this regex?

str = "<test>0</test>"
print re.search("<.*?>", str).group()
print re.search(">.*?<", str).group()
>> <text>
>> >0<

How can I get it so that the resulting text is "test" and "0" and not include the two characters I used as markers in the regex?

Upvotes: 0

Views: 115

Answers (2)

Andrew Clark
Andrew Clark

Reputation: 208665

You shouldn't be using regex to parse XML/HTML, see murgatroid99's comment.

That being said, here is how you can get the results you want for this example using regex. Use a capturing group:

>>> s = "<test>0</test>"
>>> print re.search(r"<(.*?)>", s).group(1)
test
>>> print re.search(r">(.*?)<", s).group(1)
0

Note that you shouldn't use str as a variable name, as it will mask the built-in type.

An alternative to a capturing group would be a lookbehind and lookahead:

>>> print re.search(r"(?<=<).*?(?=>)", s).group()
test
>>> print re.search(r"(?<=>).*?(?=<)", s).group()
0

Using raw string literals (r"...") isn't necessary for these in particular, but it is good to get into the habit of using them when writing regular expressions to make sure that backslashes are handled properly.

Upvotes: 4

Aaron
Aaron

Reputation: 846

You should place the text you want in a backreference and you could use re.sub to substitute that string.

By the way, you can do this in 1 regex:

"<\([^>]\)*>"

I didn't test it, but it should work, just replace the string by the backreference (\1).

Edit: my apologies, I didn't realise you wanted the text in the tag too..

Upvotes: 0

Related Questions