BM100
BM100

Reputation: 25

Java Regex Lookaround Query - I am struggling

So I have been asked to write a script that takes a large IIS Log as an input and processes it for some logging stuff. The IIS logs contains a lot of useless (to me) information, with a few blobs that contain when a user accesses something. These are in the format domain\identity.

I have the capture group:

(DOMAIN\\[a-z]\d+)

This matches the domain name and the identity (which is format starting with a single letter and followed by some numbers (which arent a fixed length). Examples: test\t123456 or test\b213.

I was hoping for someone better at Java REGEX than me could help figure out how to capture everything APART from that capture group. I want to run a query that deletes everything else that isnt that.

Because I have that capture group, I could always just write matches to a new file and achieve the same output... but the tool I use (Apache Nifi) has the tool to easily replace things, but i would have to do a bit more fiddly (e.g, use an actual script) to make a new output based on matches.

I know there are probably countless other ways of doing what I want in a easier way... but because I have wasted 20mins playing on regex101 in vain, I was hoping someone could enlighten me. An example line in the log looks like this:

testingtesting123 test\t12345 512.1235.212.321 Apples+Test/9.9.9+(Product:+129+10.492.29) - 400 testing testing123

Upvotes: 0

Views: 48

Answers (1)

Patrick Janser
Patrick Janser

Reputation: 4244

What about capturing all the log entry, with your capturing group and the ending new line. Then replace it just by the capturing group and the new line.

Then add an alternative to match a full log entry, just to drop it without replacing it in the substitution.

The commented regex, Java flavour, with the m and x flags :

^ # Begin of log entry (assuming it's a begin of line).
(?: # Two variants:
  # A) A line to keep where we extract the domain\user
  # Anything, ungreedy, to avoid "eating" the domain\user.
  .*?
  # Begin of word, domain\user, end of word.
  \b(?<domainUser>DOMAIN\\[a-z]\d+)\b
  # Anything and the captured new line, to use in the replacement.
  .*(?<newLine>\R|\z)
|
  # B) A log entry without the interesting domain\user.
  .*\R
)

The substitution would be ${domainUser}${newLine}.

In action: https://regex101.com/r/nkUSR3/1

If you can't use the x flag for comments or use named groups, then it could be simplified (less readable), like this:

const regex = /^(?:.*?\b(DOMAIN\\[a-z]\d+)\b.*(\r?\n|$)|.*\r?\n)/gm;

const input = `testingtesting123 DOMAIN\\b213 512.1235.212.321 Apples+Test/9.9.9+(Product:+129+10.492.29) - 400 testing testing123
testingtesting123 test\\t12345 512.1235.212.321 Apples+Test/9.9.9+(Product:+129+10.492.29) - 400 testing testing123
testingtesting123 domain2\\t54321 512.1235.212.321 Apples+Test/9.9.9+(Product:+129+10.492.29) - 400 domain2-testing testing54321
testingtesting123 DOMAIN\\z5642 512.1235.212.321 Apples+Test/9.9.9+(Product:+129+10.492.29) - 400 testing testing123
testingtesting123 test\\t12345 512.1235.212.321 Apples+Test/9.9.9+(Product:+129+10.492.29) - 400 testing testing123
testingtesting123 test\\b145 512.1235.212.321 Apples+Test/9.9.9+(Product:+129+10.492.29) - 400 testing testing123
testingtesting123 test\\z24592 512.1235.212.321 Apples+Test/9.9.9+(Product:+129+10.492.29) - 400 testing testing123
testingtesting123 test\\p345 512.1235.212.321 Apples+Test/9.9.9+(Product:+129+10.492.29) - 400 testing testing123
testingtesting123 DOMAIN\\y452 512.1235.212.321 Apples+Test/9.9.9+(Product:+129+10.492.29) - 400 testing testing123`;
const substitution = `$1$2`;

console.log('Substitution result: ');
console.log(input.replace(regex, substitution));

Upvotes: 0

Related Questions