ha9u63a7
ha9u63a7

Reputation: 6814

Capturing a regex group nested/enclosed by special character

I am trying to search for words that appear within tilde (~) sign borders.

 e.g. ~albert~ is a ~good~ boy.

I know that this is possible by using ~.+?~,and it already works for me. But There are special cases when I need to match a nested tilde sentence.

 e.g. ~The ~spectacle~~ was ~broken~

In the example above, I have to capture 'The Spectacle', 'spectacle', and 'broken' separately. These will be translated either word-by-word or with accompanying article (An, The, whatever). The reason is that in my system:

1) 'The spectacle' requires a separate translation on a specific cases.
2) 'Spectacle' also needs translation on specific cases.
3) IF a tranlsation exist for The spectacle, we will use that, ELSE 
   we will use 

Another example to explain this is:

 ~The ~spectacle~~ was ~borken~, but that was not the same ~spectacle~ 
  that was given to ~me~.

In the example above, I will have translation for:

 1) 'The spectacle' (because the translation case exists for 'The spectacle', otherwise I would've only translated spectacle on it's own)
 2) 'broken'
 3) 'spectacle'
 4) me

I am having trouble combining an expression which will make sure that this is captured in my regular expression. The one that I have managed to work with so far is '~.+?~'. But I know that with some form of lookahead or lookbehind, I can get this working. Could anyone help me on this?

The most important aspect in this is the regression-proofing, which will ensure that the existing stuff don't break. If I manage to get it right, I will post it.

N.B. If it helps, currently I will have instances where only one level of nesting will require decomposition. so ~The ~spectacle~~ will be deepest level (until i need more!!!!!)

Upvotes: 4

Views: 338

Answers (2)

Downgoat
Downgoat

Reputation: 14361

I wrote something like this a while ago, I haven't tested it much though:

(~(?(?=.*?~~.*?~).*?~.*?~.*?~|[^~]+?~))

or

(~(?(?=.*?~[A-Za-z]*?~.*?~).*?~.*?~.*?~|[^~]+?~))

RegEx101

Another alternative

(~(?:.*?~.*?~){0,2}.*?~)
                 ^^ change to max depth

which ever works best

To add more add a few extra sets of .*?~ in the two places where you see a bunch.

The main problem

If we allow unlimited nesting How would we know where it would end and begin? A clumsy diagram:

~This text could be nested ~ so could this~ and this~ this ~Also this~
|                          |              |_________|      |         |
|                          |_______________________________|         |
|____________________________________________________________________|

or:

~This text could be nested ~ so could this~ and this~ this ~Also this~
|                          |              |         |      |_________|
|                          |______________|         |
|___________________________________________________|

The compiler would have no idea which to choose

For your sentence

~The ~spectacle~~ was ~broken~, but that was not the same ~spectacle~ that was given to ~me~.
|    |         ||_____|      |                            |         |
|    |         |_____________|                            |         |
|    |____________________________________________________|         |
|___________________________________________________________________|

or:

~The ~spectacle~~ was ~broken~, but that was not the same ~spectacle~ that was given to ~me~.
|    |_________||     |______|                            |_________|                   |__|
|_______________|

What should I do?

Use an alternating character (as @tbraun suggested) so the compiler knows where to start and end:

{This text can be {properly {nested}} without problems} because {the compiler {can {see {the}}} start and end points} easily. Or use a compiler:

Note: I don't do Java much so some code might be incorrect

import java.util.List;

String[] chars = myString.split('');
int depth = 0;
int lastMath = 0;
List<String> results = new ArrayList<String>();

for (int i = 0; i < chars.length; i += 1) {
    if (chars[i] === '{') {
        depth += 1;
        if (depth === 1) {
            lastIndex = i;
        }
    }
    if (chars[i] === '}') {
        depth -= 1;
        if (depth === 0) {
            results.add(StringUtils.join(Arrays.copyOfRange(chars, lastIndex, i + 1), ''));
        }
        if (depth < 0) {
            // Balancing problem Handle an error
        }
    }
}

This uses StringUtils

Upvotes: 2

tbraun
tbraun

Reputation: 2666

You'll need something to differentiate start/finish patterns. I.e. {}

Than you can use pattern \{[^{]*?\} to exclude {:

{The {spectacle}} was {broken}

First iteration

{spectacle}
{broken}

Second iteration

{The spectacle}

Upvotes: -1

Related Questions