Martin Edlman
Martin Edlman

Reputation: 753

Perl regex repetition match

I'm facing strange behaviour when using '?' regexp repetition. I'm processing log file where I'm searching for specific HTTP error responses, eg. 401. The line may but may not contain body of the response. So I want to match both cases. I have following code.

#!/usr/bin/perl
$match = 'response 401';
$line = '2021-04-08 07:15:01 |  INFO | [http-nio-8080-exec-11] | rId:123456789 | ip:127.0.0.1 | activationId: abcdefg | user: admin | response 401: headers: Cache-Control: [no-cache, no-store, max-age=0, must-revalidate] / Content-Length: [60] / Content-Type: [application/json;charset=UTF-8] / Date: [Thu, 08 Apr 2021 05:15:01 GMT] / Expires: [0] / Pragma: [no-cache] | body: {"errors":[{"message":"Bad credentials","repeatable":true}]}';
    
my($tstamp, $level, $thread, $body) = $line =~ m/^(.*?)\s+\|\s+(\w+)\s+\|\s+\[(.*?)\].*?$match.*?(?:body\:\s+({.*}))?/;
if($body) {
  print "body: $body\n";
}

This won't print anything. I would expect it should as the .*?$match.*? should match the smallest part of line and leave enough for body pattern. But obviously it won't. When I change the regexp and remove ? from body pattern and make it mandatory the line matches.

my($tstamp, $level, $thread, $body) = $line =~ m/^(.*?)\s+\|\s+(\w+)\s+\|\s+\[(.*?)\].*?$match.*?(?:body\:\s+({.*}))/;

But this won't match lines where there is no body. What's wrong with the regexp? I suspect the non-greedy .*? pattern preceding (?:body...)? pattern eats the input as it's ok with the optional body.

How to write the correct regexp?

Upvotes: 4

Views: 154

Answers (2)

The fourth bird
The fourth bird

Reputation: 163632

You could use an optional part with group 4 and assert the end of the string.

^(.*?)\s+\|\s+(\w+)\s+\|\s+\[([^][]*)\].*?(?:\s+\|\s+body:\s+({.*}))?$
  • ^ Start of string
  • (.*?)\s+\| Capture group 1 match any char as least as possible and match spaces and |
  • \s+(\w+)\s+\| Match whitespaces and capture 1+ word chars in group 2 and match spaces and |
  • \s+\[([^][]*)\] Match whitespaces and capture all between [...] in group 3
  • .*? Match any char as least as possible
  • (?:\s+\|\s+body:\h+({.*}))? Optionally match | between spaces, body: and capture all between {...} in group 4
  • $ End of string

Regex demo

Using the example code:

$match = 'response 401';
$line = '2021-04-08 07:15:01 |  INFO | [http-nio-8080-exec-11] | rId:123456789 | ip:127.0.0.1 | activationId: abcdefg | user: admin | response 401: headers: Cache-Control: [no-cache, no-store, max-age=0, must-revalidate] / Content-Length: [60] / Content-Type: [application/json;charset=UTF-8] / Date: [Thu, 08 Apr 2021 05:15:01 GMT] / Expires: [0] / Pragma: [no-cache] | body: {"errors":[{"message":"Bad credentials","repeatable":true}]}';

my($tstamp, $level, $thread, $body) = $line =~ m/^(.*?)\s+\|\s+(\w+)\s+\|\s+\[([^][]*)\].*?(?:\s+\|\s+body:\s+({.*}))?$/;
if($body) {
  print "body: $body\n";
}

Output

body: {"errors":[{"message":"Bad credentials","repeatable":true}]}

If there is no body, you can still get the value of $tstamp, $level and $thread

Upvotes: 5

RavinderSingh13
RavinderSingh13

Reputation: 133770

With your shown samples, could you please try following.

^(\d{4}-\d{2}-\d{2}\s*(?:\d{2}:){2}\d{2})\s+\|\s+(\S+)\s+\|\s+\[([^]]*)\].*?(body.*)?$

Here is online demo for above regex

Explanation: Adding detailed explanation for above.

^                                          ##Matching starting of value by caret sign.
(\d{4}-\d{2}-\d{2}\s*(?:\d{2}:){2}\d{2})   ##Creating 1st capturing group to match time stamp here.
\s+\|\s+                                   ##Matching spaces pipe spaces(one or more occurrences).
(\S+)                                      ##Creating 2nd capturing group which has everything apart from space, which will have INFO/WARN/ERROR etc here.
\s+\|\s+\[                                 ##Matching spaces pipe spaces(one or more occurrences).
([^]]*)                                    ##Creating 3rd capturing group which has everything till ] occurrence in it.
\].*?                                      ##Matching ] with lazy match.
(body.*)?$                                 ##Creating 4th capturing group which will match from body to till end of line and keeping it optional at the end of the line/value.

Upvotes: 6

Related Questions