amorino
amorino

Reputation: 402

Regex ignore if empty

I have two conditions in my regex (regex used on php)

(BIOLOGIQUES\s+(\d+)\s+(\d+)\s+\/\s+(\d+))|(Dossier N.\s+:\s+(\d+)\s+(\d+)\s+\/\s+(\d+))

When I test the 1st condition with the following I get 4 match groups 1 2 3 and 4

BIOLOGIQUES                                                                                          47     131002 / 4302

Please see the 1st condition here http://www.rubular.com/r/a6zQS8Wth6

But when I test with the second condition the groups match are 5 6 7 and 8

   Dossier N°       :     47     131002 / 4302

The second condition here : http://www.rubular.com/r/eYzBJq1rIW

Is there a way to always have 1 2 3 and 4 match groups in the second condition too?

Upvotes: 1

Views: 224

Answers (2)

user557597
user557597

Reputation:

I assume your regex is compressed. If the dot is meant to abbrev. the middle initial it should be escaped. The suggestion below factors out like Barmar's does. If you don't want to capture the different names, remove the parenthesis from them.

Sorry, it looks like you intend it to be a dot metachar. Just remove the \ from it.

 # (?:(BIOLOGIQUES)|(Dossier\ N\.\s+:))\s+((\d+)\s+(\d+)\s+\/\s+(\d+))

 (?:
      ( BIOLOGIQUES )                 # (1)
   |  ( Dossier\ N \. \s+ : )         # (2)
 )
 \s+ 
 (                               # (3 start)
      ( \d+ )                         # (4)
      \s+ 
      ( \d+ )                         # (5)
      \s+ \/ \s+ 
      ( \d+ )                         # (6)
 )                               # (3 end)

Edit, the regex should be factored, but if it gets too different, a way to re-use the same capture groups is to use Branch Reset.
Here is your original code with some annotations using branch reset.

 (?|(BIOLOGIQUES\s+(\d+)\s+(\d+)\s+\/\s+(\d+))|(Dossier\ N.\s+:\s+(\d+)\s+(\d+)\s+\/\s+(\d+)))

      (?|
 br 1      (                               # (1 start)
                BIOLOGIQUES \s+ 
      2         ( \d+ )                         # (2)
                \s+ 
      3         ( \d+ )                         # (3)
                \s+ \/ \s+ 
      4         ( \d+ )                         # (4)
    1      )                               # (1 end)
        |  
 br 1      (                               # (1 start)
                Dossier\ N . \s+ : \s+ 
      2         ( \d+ )                         # (2)
                \s+ 
      3         ( \d+ )                         # (3)
                \s+ \/ \s+ 
      4         ( \d+ )                         # (4)
    1      )                               # (1 end)
      )

Or, you could factor it AND use branch reset.

 # (?|(BIOLOGIQUES\s+)|(Dossier\ N.\s+:\s+))(?:(\d+)\s+(\d+)\s+\/\s+(\d+))

      (?|
 br 1      ( BIOLOGIQUES \s+ )             # (1)
        |  
 br 1      ( Dossier\ N . \s+ : \s+ )      # (1)
      )
      (?:
 2         ( \d+ )                         # (2)
           \s+ 
 3         ( \d+ )                         # (3)
           \s+ \/ \s+ 
 4         ( \d+ )                         # (4)
      )

Upvotes: 0

Barmar
Barmar

Reputation: 780842

Since the parts of both regexps that match the numbers are the same, you can do the alternation just for the beginning, instead of around the entire regexp:

preg_match('/((?:BIOLOGIQUES|Dossier N.\s+:)\s+(\d+)\s+(\d+)\s+\/\s+(\d+))/u', $content, $match);

Use the u modifier to match UTF-8 characters correctly.

Upvotes: 3

Related Questions