Reputation: 6814
I am trying to search for words that appear within tilde (~
) sign borders.
e.g. ~albert~ is a ~good~ boy.
I know that this is possible by using ~.+?~
,and it already works for me. But There are special cases when I need to match a nested tilde sentence.
e.g. ~The ~spectacle~~ was ~broken~
In the example above, I have to capture 'The Spectacle', 'spectacle', and 'broken' separately. These will be translated either word-by-word or with accompanying article (An, The, whatever). The reason is that in my system:
1) 'The spectacle' requires a separate translation on a specific cases.
2) 'Spectacle' also needs translation on specific cases.
3) IF a tranlsation exist for The spectacle, we will use that, ELSE
we will use
Another example to explain this is:
~The ~spectacle~~ was ~borken~, but that was not the same ~spectacle~
that was given to ~me~.
In the example above, I will have translation for:
1) 'The spectacle' (because the translation case exists for 'The spectacle', otherwise I would've only translated spectacle on it's own)
2) 'broken'
3) 'spectacle'
4) me
I am having trouble combining an expression which will make sure that this is captured in my regular expression. The one that I have managed to work with so far is '~.+?~'. But I know that with some form of lookahead or lookbehind, I can get this working. Could anyone help me on this?
The most important aspect in this is the regression-proofing, which will ensure that the existing stuff don't break. If I manage to get it right, I will post it.
N.B. If it helps, currently I will have instances where only one level of nesting will require decomposition. so ~The ~spectacle~~ will be deepest level (until i need more!!!!!)
Upvotes: 4
Views: 338
Reputation: 14361
I wrote something like this a while ago, I haven't tested it much though:
(~(?(?=.*?~~.*?~).*?~.*?~.*?~|[^~]+?~))
or
(~(?(?=.*?~[A-Za-z]*?~.*?~).*?~.*?~.*?~|[^~]+?~))
Another alternative
(~(?:.*?~.*?~){0,2}.*?~)
^^ change to max depth
which ever works best
To add more add a few extra sets of .*?~
in the two places where you see a bunch.
If we allow unlimited nesting How would we know where it would end and begin? A clumsy diagram:
~This text could be nested ~ so could this~ and this~ this ~Also this~
| | |_________| | |
| |_______________________________| |
|____________________________________________________________________|
or:
~This text could be nested ~ so could this~ and this~ this ~Also this~
| | | | |_________|
| |______________| |
|___________________________________________________|
The compiler would have no idea which to choose
~The ~spectacle~~ was ~broken~, but that was not the same ~spectacle~ that was given to ~me~.
| | ||_____| | | |
| | |_____________| | |
| |____________________________________________________| |
|___________________________________________________________________|
or:
~The ~spectacle~~ was ~broken~, but that was not the same ~spectacle~ that was given to ~me~.
| |_________|| |______| |_________| |__|
|_______________|
Use an alternating character (as @tbraun suggested) so the compiler knows where to start and end:
{This text can be {properly {nested}} without problems} because {the compiler {can {see {the}}} start and end points} easily. Or use a compiler:
Note: I don't do Java much so some code might be incorrect
import java.util.List;
String[] chars = myString.split('');
int depth = 0;
int lastMath = 0;
List<String> results = new ArrayList<String>();
for (int i = 0; i < chars.length; i += 1) {
if (chars[i] === '{') {
depth += 1;
if (depth === 1) {
lastIndex = i;
}
}
if (chars[i] === '}') {
depth -= 1;
if (depth === 0) {
results.add(StringUtils.join(Arrays.copyOfRange(chars, lastIndex, i + 1), ''));
}
if (depth < 0) {
// Balancing problem Handle an error
}
}
}
This uses StringUtils
Upvotes: 2
Reputation: 2666
You'll need something to differentiate start/finish patterns. I.e. {}
Than you can use pattern \{[^{]*?\}
to exclude {
:
{The {spectacle}} was {broken}
First iteration
{spectacle}
{broken}
Second iteration
{The spectacle}
Upvotes: -1