Kristen
Kristen

Reputation: 4291

Remove inner nested string with RegEx

I have a string that is formed from tag-substitution, which also results in parts of the string being marked for deletion, for example:

Keep1
{/*DELETE}
Delete1a
    {/*DELETE}
    Delete2
    {DELETE*/}
Delete1b
{DELETE*/}
Keep2
{/*DELETE}
Delete3
{DELETE*/}
Keep3

Am I correct that a RegEx cannot be used to select only the inner DELETE2 and DELETE3, remove those, and then repeat to get the DELETE1a/b until no further matches are found?

The RegEx I am passing to my replace function is

\{\/\*DELETE\}([\s\S]*?)\{DELETE\*\/\}

This matches

{/*DELETE}
Delete1a
    {/*DELETE}
    Delete2
    {DELETE*/}

If this is the only RegEx match that I can make I could [suppress the leading {/*DELETE} and] call the replace function recursively which, I think, would enable me to remove the nested {TAGS}

Is a better way?

I am using the RegEx in VBScript

EDIT: In case it helps I can change the {/*DELETE} and {DELETE*/} tags, even to a single character

EDIT2: I could use a single-character as the Start/End delete marker - if, for example, that would be faster for a RegEx expression to resolve e.g. by being less complex

e.g. if the Start-Delete is [ and then end delete is ]

Keep1
[
Delete1a
    [
    Delete2
    ]
Delete1b
]
Keep2
[
Delete3
]
Keep3

These characters chosen for appearance in this example, in practice they would occur within my real-world data, but I expect I could chose two ASCII values which do not appear in my data at all.

Clarification: The {DELETE} tags will not always appear on a line by themselves, so this style of string formation will also exist

Keep1{/*DELETE}Delete1a
    {/*DELETE}Delete2{DELETE*/}
Delete1b{DELETE*/}Keep2a
Keep2b{/*DELETE}Delete3{DELETE*/}Keep3

or with single-character delete-tags:

Keep1[Delete1a
    [Delete2]
Delete1b]Keep2a
Keep2b[Delete3]Keep3

Upvotes: 2

Views: 216

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626861

Multicharacter delimiters

If your delimiters are multicharacter tags, you may use a tempered greedy token:

\{\/\*DELETE}((?:(?!\{\/\*DELETE})[\s\S])*?)\{DELETE\*\/}
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

that will match any char, 0+ times, that is not a starting point for a {/*DELETE} char sequence. Run this regex replace recursively, see Iteration 1 and Iteration 2 demos.

NOTE that if you have these delimiters inside comments or string literals, this won't work correctly.

To make it safe, you may define that the delimiting tags only appear as single entities on a line:

^\s*\{\/\*DELETE}(\s*(?:\r?\n(?!\s*\{(?:\/\*DELETE|DELETE\*\/)}).*)*)\r?\n\s*\{DELETE\*\/}\s*$

See Iteration 1 and Iteration 2 demos (here, you will need to enable regExp.Multiline = True)

Single char delimiters

This is by far the easiest scenario - you may the starting delimiter char, then match any 0+ chars other than the starting and ending delimiter char using a negated character class - and then the ending delimiter char.

If the starting delimiter char is [ and the ending delimiter char is ], the regex is a well-known

\[[^\][]*\]

See the regex demo: Iteration 1 and Iteration 2.

Note that [ and ] usually are part of data you need, so perhaps, you will want to use some more fancy paired stuff, like (‎2985 LEFT WHITE PARENTHESIS) and (‎2986 RIGHT WHITE PARENTHESIS):

\u2985[^\u2985\u2986]*\u2986

See another regex demo.

Upvotes: 2

Related Questions