Reputation: 11
For example, there is a code like
<tag1 blablablah>sometext<i>sometext</i>sometext<i>sometext</i>sometext</tag1>
What I want to do is make it like
<tag1 blablablah>sometext<XXX><i></XXX>sometext<XXX></i></XXX>sometext<XXX><i></XXX>sometext<XXX></i></XXX>sometext</tag1>
I am using regex for searching (it works with Notepad++ and Python's re.compile function too)
(<tag1[^>]*>.*?)(<[^>]*>.*?)(.*?</tag1>)
And for replacing (it works with re.sub too)
\1<XXX>\2</XXX>\3
BUT it finds and change only the first occurrence not all off them...
<tag1 blablablah>sometext<XXX><i></XXX>sometext</i>sometext<i>sometext</i>sometext</tag1>
Can anyone help me with this?
Upvotes: 1
Views: 181
Reputation: 2548
try changing your pattern like this
(<tag1[^>]*>).*?(<[^>]+>).*?(</tag1>)
Upvotes: 0
Reputation: 1538
The problem is avoiding the first and the last tags. If you split them up, then it's pretty simple:
s = '<tag1 blablablah>sometext<i>sometext</i>sometext<i>sometext</i>sometext</tag1>'
start, end = s.find('>') + 1, s.rfind('<')
s_list = [s[:start], s[start:end], s[end:]]
s_list[1] = re.sub(r'(<[^>]*>)', r'<XXX>\1</XXX>', s_list[1])
print ''.join(s_list)
It's not a one-liner, though.
Alternatively, you can do this:
print re.sub(r'([^(^<)])(<[^>]*>(?!$))', r'\1<XXX>\2</XXX>', s)
Note that this only works if your outermost tags are at the start and the end of the string.
Upvotes: 0
Reputation: 11182
Try this
<((?:[a-z]+:)?[a-z]\w+)\b[^<>]+?>(.+)</\1>
Explanation
"
< # Match the character “<” literally
( # Match the regular expression below and capture its match into backreference number 1
(?: # Match the regular expression below
[a-z] # Match a single character in the range between “a” and “z”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
: # Match the character “:” literally
)? # Between zero and one times, as many times as possible, giving back as needed (greedy)
[a-z] # Match a single character in the range between “a” and “z”
\w # Match a single character that is a “word character” (letters, digits, and underscores)
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
\b # Assert position at a word boundary
[^<>] # Match a single character NOT present in the list “<>”
+? # Between one and unlimited times, as few times as possible, expanding as needed (lazy)
> # Match the character “>” literally
( # Match the regular expression below and capture its match into backreference number 2
. # Match any single character that is not a line break character
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
</ # Match the characters “</” literally
\1 # Match the same text as most recently matched by capturing group number 1
> # Match the character “>” literally
"
Upvotes: 2