Pankaj Agrawal
Pankaj Agrawal

Reputation: 57

Extract text without conditional lookahead in Java

I would like to extract the text in bold using Java regex support.

I could get it working using conditional lookahead, with the regex being

(\d{2})(\d{1,2})(\d{1,2})\s+(\d{1,2}):(\d{1,2}):(\d{1,2})\s+(\S+)\s+(?(?=.*\d{4}-\d{1,2}-\d{1,2})([^\d{4}]*)|(.*))

However, Java Pattern class doesn't support conditional lookaheads. Is there a way to rewrite the regex so that it works with Java Pattern class?

160203 03:24:24 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql2016-02-03 03:24:25 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).2016-02-03 03:24:25 0 [Note] /opt/devenv/mysql/mysql-5.6.27-linux-glibc2.5-x86_64/bin/mysqld (mysqld 5.6.27) starting as process 29491 ...2016-02-03 03:24:25 29491 [Note] IPv6 is available.

160203 21:33:17 mysqld_safe Number of processes running now: 0

160203 21:33:17 mysqld_safe mysqld restarted2016-02-03 21:33:18 1125 [Note] Server hostname (bind-address): '*'; port: 33062016-02-03 21:33:18 1125 [Note] IPv6 is available.

Upvotes: 0

Views: 102

Answers (1)

Alan Moore
Alan Moore

Reputation: 75222

What you're looking for is a tempered lookahead:

(?:(?!\d{4}-\d{1,2}-\d{1,2}).)*

This matches everything up to (but not including) the next thing that looks like a date, or the next line end, whichever comes first. It does this be checking each character before it's consumed to make sure it's not the first character of a date. To use this in Java:

Pattern p = Pattern.compile(
    "(?m)^(\\d{2})(\\d{1,2})(\\d{1,2})\\s+(\\d{1,2}):(\\d{1,2}):(\\d{1,2})\\s+(\\S+)\\s+((?:(?!\\d{4}-\\d{1,2}-\\d{1,2}).)*)");
Matcher m = p.matcher(s);
while (m.find()) {
    // matched text: m.group()
} 

The (?m)^ makes sure each match starts at the beginning of a line.

I should note that this is not equivalent to your conditional, but I think it's what you really wanted. Maybe it's okay with you, but given this hypothetical input:

160203 21:33:17 mysqld_safe process1 restarted2016-02-03 21:33:18 1125

...your regex stops before the 1 in process1.

The [^\d{4}]* in your regex is apparently meant to stop at the next four-character sequence, but it really stops any character that's not one of {, }, or a digit. Of course, it only does that after the lookahead has determined that there's a date up ahead.

Upvotes: 1

Related Questions