Gary
Gary

Reputation: 2937

Regular Expressions: Using a negative look ahead for the nonsupported negative look behind and capturing the look behind characters upon split

I'm struggling again with regular expressions. I've been trying to add the use of an escape character to escape a custom tag such as <1> to <57> and </1> to </57>. With the help of Georg, here, the following expression produces the desired result prior to attempting an escape method.

('This is a <21>test</21> again.').split(/(<\/?(?:[1-9]|[1-4][0-9]|5[0-7])>)/);

generates 'This is a ', '<21>', 'test', '</21>', ' again.'

This question has one suggestion of using a negative look ahead and an OR to approximate the unsupported negative look behind. I modified that example for what I thought was my simpler problem; however, I'm stumped again.

('This is a <21>test</21> again.').split(/(?:(?!\\).|^)(<\/?(?:[1-9]|[1-4][0-9]|5[0-7])>)/) );

generates 'This is a', '<21>', 'tes', '</21>', ' again.' So, it does not include the character just previous to <21> or </21> when not a \. And I see why since used the ?: for non-capture.

However, if it's removed then:

('This is a <21>test</21> again.').split(/((?!\\).|^)(<\/?(?:[1-9]|[1-4][0-9]|5[0-7])>)/) );

generates 'This is a', ' ', '<21>', 'tes', 't', '</21>', ' again.' And the previous character generates a separate split.

Apart from this problem, the escaping works such that when the previous character is a \ the tag doesn't generate a split of the string.

Could you please let me know if there is a way to capture the previous character but include it with the text of the previous string rather than its own split? And possibly exclude it only when a \?

When the string is 'This is a <21>test</21> again.', the desired result is 'This is a ', '<21>', 'test', '</21>', ' again.'

And when it is 'This is a \<21>test</21> again.', the desired result is 'This is a <21>', 'test', '</21>', ' again.'

Thank you.

Addition After recently learning about using an in-line function as a parameter in a replace operation using a regular expression at this MDN document, I started to wonder about whether or not something similar could be done here. I don't know anything about measuring performance but the complexity of the regular expression provided by Revo below and his answer to my comment about efficiency stating that a negative look behind would be a significant improvement in efficiency and less work for the RegExp engine, and also that RegExp is something of a black-box behind-the-scenes mystery to me, motivated me to experiment with another approach. It's a couple more lines of code but produces the same result and uses a much shorter regular expression. All it really does is match the tags, both with and without an escape character, rather than trying to exclude those escaped with a \, and then ignores the ones with an escape character in building the array. Snippet below.

I don't know if the times provided in the console log are indicative of performance' but, if so, in the examples I ran, it appears that the difference in time between logging start and a.split is considerably longer as a percentage than that between a.split and the final logging of array a under the exec approach.

Also, the inner most if block within the while statement is there to prevent a "" from being saved in the array when a tag is at beginning or end of the string, or when there is no space between two tags.

I'd appreciate any insight you may be able to provide concerning why or why not to use one approach over the other, or introducing a better method for the case of not having access to a true negative look behind. Thank you.

let a, i = 0, l, p, r,
    x = /\\?<\/?(?:[1-9]|[1-4]\d|5[0-7])>/g,
    T = '<1>This is a <21>test<21> of \\<22>escaped and \\> </ unescaped tags.<5>';

console.log('start');

a = T.split(/((?:[^<\\]+|\\+.?|<(?!\/?(?:[1-9]|[1-4]\d|5[0-7])>))+|<\/?(?:[1-9]|[1-4]\d|5[0-7])>)/).filter(Boolean);

      console.log(a);
      a=[];
      while ( ( r = x.exec( T ) ) !== null) {
        if ( r[0].charAt(0) !== '\\' )
          {
             if ( r.index === 0 || r.index === p )
               {
                 a[ i ] = r[0];
                 i = i + 1;
               }
             else 
               {
                 a[ i ] = T.substring( p, r.index );
                 a[ i + 1 ] = r[0];
                 i = i + 2;
               }; // end if
             p = x.lastIndex;
          }; // end if
      }; // next while

      if ( p !== T.length ) a[i] = T.substring( p );
      console.log(a)

Upvotes: 3

Views: 147

Answers (1)

revo
revo

Reputation: 48761

You are splitting on desired sub-strings and use a capturing group to have them in output. This could be happened about undesired sub-strings too. You match them and enclose them in a capturing group to have them in output. The regex would be:

(undesired-part|desired-part)

Regex for undesired sub-strings should come first because desired ones could be found in them i.e. <21> is included in \<21> so we should match the latter earlier.

You wrote the desired part and it is known to us:

(undesired-part|<\/?(?:[1-9]|[1-4]\d|5[0-7])>)

So what about undesired? Here it is:

(?:[^<\\]+|\\.?|<(?!\/?(?:[1-9]|[1-4]\d|5[0-7])>))+

Let's break it down:

  • (?: Start of non-capturing group
    • [^<\\]+ Match anything except < and \
    • | Or
    • \\.? Match an escaped character
    • | Or
    • <(?!\/?(?:[1-9]|[1-4]\d|5[0-7])>) Match a < which is not desired
  • )+ End of NCG, repeat as much as possible and at least once

Overall it is:

((?:[^<\\]+|\\.?|<(?!\/?(?:[1-9]|[1-4]\d|5[0-7])>))+|<\/?(?:[1-9]|[1-4]\d|5[0-7])>)

Js code:

console.log(
  'This is a \\<21>test</21> ag<ain\\.'.split(/((?:[^<\\]+|\\.?|<(?!\/?(?:[1-9]|[1-4]\d|5[0-7])>))+|<\/?(?:[1-9]|[1-4]\d|5[0-7])>)/).filter(Boolean)
);

Upvotes: 2

Related Questions