Funtboy
Funtboy

Reputation: 89

Parsing - Adding a capturing group

I am attempting to use a fairly complex REGEX expression (see REGEX101 demos below), which I amended slightly from one created by an expert on this site. It parses specific patterns of log events:

These log sequences will always begin with a random selection of EXE_IN or EXE_CO events, preceded sequence numbers. These selections can be any number, in any order. In this case, we just have two EXE events but this may be 200. Or 1. Note that there is a sequence number and we need to capture it.

The second part of the sequence will always be a series of digit-prefaced CONTENT.ACCESS events. Again from 1 to infinity in length.

The following demo shows a working example and probably conveys the concept better than I can : Demo 1

It nicely captures a full match, sequence number, and event in separate groups.

I need to add a timestamp to the pattern (after the sequence number, with a preceding underscore), and then parse this event log e.g.

I need to capture the timestamps as well.

I attempted to adjust the regex expression, with mixed results. Please see this demo: demo2

Ideally I'd like to see something like this for each event:

Match n
Full match  266-308 `2_12/08/2014 09:17CONTENT_ACCESS`
Group 1. 266-267    `2`
Group 2. 268-284    `12/08/2014 09:17`
Group 3. 284-308    `CONTENT_ACCESS`

I hope you can help me. REGEX101 pcre testing is sufficient (for the record, I am using perl-compatible str_match_all_perl function in R).

Many thanks in advance.

Upvotes: 0

Views: 76

Answers (1)

CrafterKolyan
CrafterKolyan

Reputation: 1052

(\d+)_(.*?)(EXE_CO|EXE_IN|CONTENT_ACCESS)

https://regex101.com/r/EHHcKm/1

Due to comments it was changed to (?:\G(?!^)(?(?=\d+_\d{2}\/\d{2}\/\d{4}\s\d{2}\:\d{2}(?:EXE_CO|EXE_IN))(?<!\d_\d{2}\/\d{2}\/\d{4}\s\d{2}\:\d{2}CONTENT_ACCESS))|(?=(?:\d+_\d{2}\/\d{2}\/\d{4}\s\d{2}\:\d{2}(?:EXE_CO|EXE_IN))+(?:\d+_\d{2}\/\d{2}\/\d{4}\s\d{2}\:\d{2}CONTENT_ACCESS)+))(\d+)_(\d{2}\/\d{2}\/\d{4}\s\d{2}\:\d{2})(EXE_CO|EXE_IN|CONTENT_ACCESS)

https://regex101.com/r/EHHcKm/3

Ans also another version, which is shorter (?:\G(?!^)(?(?=\d+_.{16}(?:EXE_CO|EXE_IN))(?<!\d_.{16}CONTENT_ACCESS))|(?=(?:\d+_.{16}(?:EXE_CO|EXE_IN))+(?:\d+_.{16}CONTENT_ACCESS)+))(\d+)_(.{16})(EXE_CO|EXE_IN|CONTENT_ACCESS)

https://regex101.com/r/EHHcKm/4

And even more shorter (?:\G(?!^)(?(?=\d+_.{16}E)(?<!S))|(?=(?:\d+_.{16}(?:EXE_CO|EXE_IN))+\d+_.{16}C))(\d+)_(.{16})(EXE_CO|EXE_IN|CONTENT_ACCESS)

https://regex101.com/r/EHHcKm/5

And super short (?:\G|(?=\d+_.{16}E.*CON))(\d+)_(.*?)(EXE_CO|EXE_IN|CONTENT_ACCESS)

https://regex101.com/r/EHHcKm/8

Upvotes: 1

Related Questions