user1421819
user1421819

Reputation: 11

How to catch all tags inside specific tag by regex?

For example, there is a code like

<tag1 blablablah>sometext<i>sometext</i>sometext<i>sometext</i>sometext</tag1>

What I want to do is make it like

<tag1 blablablah>sometext<XXX><i></XXX>sometext<XXX></i></XXX>sometext<XXX><i></XXX>sometext<XXX></i></XXX>sometext</tag1>

I am using regex for searching (it works with Notepad++ and Python's re.compile function too)

(<tag1[^>]*>.*?)(<[^>]*>.*?)(.*?</tag1>)

And for replacing (it works with re.sub too)

\1<XXX>\2</XXX>\3

BUT it finds and change only the first occurrence not all off them...

<tag1 blablablah>sometext<XXX><i></XXX>sometext</i>sometext<i>sometext</i>sometext</tag1>

Can anyone help me with this?

Upvotes: 1

Views: 181

Answers (3)

Darshana
Darshana

Reputation: 2548

try changing your pattern like this

(<tag1[^>]*>).*?(<[^>]+>).*?(</tag1>)

Upvotes: 0

Kiet Tran
Kiet Tran

Reputation: 1538

The problem is avoiding the first and the last tags. If you split them up, then it's pretty simple:

s = '<tag1 blablablah>sometext<i>sometext</i>sometext<i>sometext</i>sometext</tag1>'
start, end = s.find('>') + 1, s.rfind('<')
s_list = [s[:start], s[start:end], s[end:]]
s_list[1] = re.sub(r'(<[^>]*>)', r'<XXX>\1</XXX>', s_list[1])
print ''.join(s_list)

It's not a one-liner, though.

Alternatively, you can do this:

print re.sub(r'([^(^<)])(<[^>]*>(?!$))', r'\1<XXX>\2</XXX>', s)

Note that this only works if your outermost tags are at the start and the end of the string.

Upvotes: 0

Cylian
Cylian

Reputation: 11182

Try this

<((?:[a-z]+:)?[a-z]\w+)\b[^<>]+?>(.+)</\1>

Explanation

"
<              # Match the character “<” literally
(              # Match the regular expression below and capture its match into backreference number 1
   (?:            # Match the regular expression below
      [a-z]          # Match a single character in the range between “a” and “z”
         +              # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
      :              # Match the character “:” literally
   )?             # Between zero and one times, as many times as possible, giving back as needed (greedy)
   [a-z]          # Match a single character in the range between “a” and “z”
   \w             # Match a single character that is a “word character” (letters, digits, and underscores)
      +              # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
\b             # Assert position at a word boundary
[^<>]          # Match a single character NOT present in the list “<>”
   +?             # Between one and unlimited times, as few times as possible, expanding as needed (lazy)
>              # Match the character “>” literally
(              # Match the regular expression below and capture its match into backreference number 2
   .              # Match any single character that is not a line break character
      +              # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
</             # Match the characters “</” literally
\1             # Match the same text as most recently matched by capturing group number 1
>              # Match the character “>” literally
"

Upvotes: 2

Related Questions