Reputation: 6371
str = "<test>0</test>"
print re.search("<.*?>", str).group()
print re.search(">.*?<", str).group()
>> <text>
>> >0<
How can I get it so that the resulting text is "test" and "0" and not include the two characters I used as markers in the regex?
Upvotes: 0
Views: 115
Reputation: 208665
You shouldn't be using regex to parse XML/HTML, see murgatroid99's comment.
That being said, here is how you can get the results you want for this example using regex. Use a capturing group:
>>> s = "<test>0</test>"
>>> print re.search(r"<(.*?)>", s).group(1)
test
>>> print re.search(r">(.*?)<", s).group(1)
0
Note that you shouldn't use str
as a variable name, as it will mask the built-in type.
An alternative to a capturing group would be a lookbehind and lookahead:
>>> print re.search(r"(?<=<).*?(?=>)", s).group()
test
>>> print re.search(r"(?<=>).*?(?=<)", s).group()
0
Using raw string literals (r"..."
) isn't necessary for these in particular, but it is good to get into the habit of using them when writing regular expressions to make sure that backslashes are handled properly.
Upvotes: 4
Reputation: 846
You should place the text you want in a backreference and you could use re.sub to substitute that string.
By the way, you can do this in 1 regex:
"<\([^>]\)*>"
I didn't test it, but it should work, just replace the string by the backreference (\1).
Edit: my apologies, I didn't realise you wanted the text in the tag too..
Upvotes: 0