Rashmi Choudhary
Rashmi Choudhary

Reputation: 53

create a generic regex for a string in perl

I have tried to create regex for the below:

STRING sou_u02_mlpv0747_CCF_ASB001_LU_FW_ALERT|/opt/app/medvhs/mvs/applications/cm_vm5/fwhome/UnifiedLogging|UL_\d{8}_CCF_ASB001_LU_sou_u02_mlpv0747_Primary.log.csv|FATAL|red|1h||fw_alert

REGEX----> /^[^#]\w+\|[^\|]+\|\w+\|\w+\|\w*\|\w*\|([^\|]+|)\|\w*$/

I am unable to figure out the mistake here.

I created the above by referring another regex which working fine and given below

/^[^#]\w+\|[^\|]+\|([^\|]+|)\|[rm]\|(in|out|old|new|arch|missing)\|\w+\|([0-9-,]+|)\|\w*\|\w*$/

sou_u02_mlpv0747_CCF_ASB001_LU_ODR|/opt/app/medvhs/mvs/applications/cm_vm5/components/CCF_ASB001_LU/SPOOL/ODR||r|out|30m|0400-1959|30m|gprs_in_stag

can some one please help me. Any leads would be highly apprciated.

Upvotes: 0

Views: 74

Answers (2)

Valdi_Bo
Valdi_Bo

Reputation: 30971

Let's start from a brief look at your source text (the first that you included).

It is composed of "sections" separated with | char.

This char (|) must be matched by \|. Remember about the preceding backslash, otherwise, a "bare" | would mean the alternative separator (you used it in one place).

And now take a look at each section (between |):

  • Some of them contain only a sequence of word chars (and can be matched by \w+).
  • Other sections, however, contain also other chars, e.g. slashes, backslash, braces and dots, so each such section is actually a sequence of chars other than "|" and must be matched by [^|]+ (here, between [ and ], the vertical bar may be unescaped).

Now let's write each section and its "type":

  1. sou_u02_..._FW_ALERT - word chars.
  2. /opt/app/.../UnifiedLogging - other chars (because of slashes).
  3. UL_\d{8}_..._Primary.log.csv - other chars (because of \d{8} and dots).
  4. FATAL|red|1h - 3 sections composed of word chars.
  5. An empty section, between 2 consecutive | chars.
  6. fw_alert - word chars.

And now, how to match these groups, and the separating |:

  • Point 1: \w+\| - word chars and (escaped) vertical bar.
  • Point 2 and 3 (together): (?:[^|]+\|){2} - a non-capturing group - (?:...), containing a sequence of "other" chars - [^|]+ and a vertical bar - \|, occurring 2 times {2}.
  • Point 4 (three "word char" groups): (?:\w+\|){3} - similiar to the previous point.
  • Point 5: Just as in your solution - ([^|]+|)\|, a capturing group - (...), with 2 alternatives ...|.... The first alternative is [^|]+ (a sequence of "other" chars), and the second alternative is empty. After the capturing group there is \| to match the vertical bar.
  • Point 6: \w+ - word chars. This time no \|, as this is the last section.

The regex assembled so far must be:

  • prepended with a ^ (start of string) and
  • appended with a $ (end of string).

So the whole regex, matching your source text can be:

^\w+\|(?:[^|]+\|){2}(?:\w+\|){3}([^|]+|)\|\w+$

Actually, the only capturing group can be written another way, as ([^|]*) - without alternatives, but with * as the repetition count, allowing also empty content. Your choice, which variant to apply.

Upvotes: 2

Borodin
Borodin

Reputation: 126722

The third field

UL_\d{8}_CCF_ASB001_LU_sou_u02_mlpv0747_Primary.log.csv

Contains a backslash, \, braces { } and dots .. None of these can be matched by \w

Note also that there is no need to escape a pipe | inside a characters class: [^|]+ is fine

Upvotes: 0

Related Questions