Eghbal
Eghbal

Reputation: 3783

Why does MATLAB's regexp only return first match token string?

Suppose that we have this code in MATLAB:

ax = 'aa+bb+cc+dd';
middle_part = regexp(ax, '\+(\w+)\+','tokens');

Why does MATLAB only return 'bb' as output, and not 'bb' and 'cc'?

Upvotes: 3

Views: 380

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626952

You need to place the second + into a lookahead so that it is not consumed by the regex engine. Here is an answer of mine on how look-aheads work.

Here is code snippet:

ax = 'aa+bb+cc+dd';
middle_part = regexp(ax, '\+(\w+)(?=\+)','tokens');
disp(middle_part)

Result:

{                                                                                                                                                                   
  [1,1] =                                                                                                                                                           
  {                                                                                                                                                                 
    [1,1] = bb                                                                                                                                                      
  }                                                                                                                                                                 
  [1,2] =                                                                                                                                                           
  {                                                                                                                                                                 
    [1,1] = cc                                                                                                                                                      
  }                                                                                                                                                                 
}         

So, in short, here is what is going on: \+(\w+)\+ matches +bb+, and moves the index right after the + that is after bb. So, there is only cc+dd to be tested. No match is found as the pattern requires 2 + symbols around 1 or more word characters.

With a lookahead version, \+(\w+)(?=\+), the engine matches +bb that is right in front of a + and moves the index right after the second b. The string left is +cc+dd. So, there is another +cc match.

Upvotes: 2

Related Questions