Guillaume
Guillaume

Reputation: 5557

java regex: capture multiline sequence between tokens

I'm struggling with regex for splitting logs files into log sequence in order to match pattern inside these sequences. log format is:

timestamp fieldA fieldB fieldn log message1 
timestamp fieldA fieldB fieldn log message2
log message2bis
timestamp fieldA fieldB fieldn log message3 

The timestamp regex is known.

I want to extract every log sequence (potentialy multiline) between timestamps. And I want to keep the timestamp.

I want in the same time to keep the exact count of lines.

What I need is how to decorate timestamp pattern to make it split my log file in log sequence. I can not split the whole file as a String, since the file content is provided in a CharBuffer

Here is sample method that will be using this log sequence matcher:

private void matches(File f, CharBuffer cb) {
    Matcher sequenceBreak = sequencePattern.matcher(cb);    // sequence matcher
    int lines = 1;
    int sequences = 0;

    while (sequenceBreak.find()) {
        sequences++;

        String sequence = sequenceBreak.group();
        if (filter.accept(sequence)) {
            System.out.println(f + ":" + lines + ":" + sequence);                
        }

        //count lines
        Matcher lineBreak = LINE_PATTERN.matcher(sequence);
        while (lineBreak.find()) {
            lines++;
        }

        if (sequenceBreak.end() == cb.limit()) {
            break;
        }
    }        
}

Upvotes: 1

Views: 1985

Answers (3)

Jan Goyvaerts
Jan Goyvaerts

Reputation: 22009

If I understand your question correctly, you want to split a file using a regular expression, but you can't use Java's built-in Split() method. In that case, just write your own Split() method.

Iterate over all the regex matches. For the first match, store the timestamp and the ending position of the match. For subsequent matches, take the text between the stored ending position of the previous match and the starting position of the present match and associate that with the previous match. Then store the timestamp and ending position of the present match. After the loop, take the text between the stored ending position of the last match and the end of the file and associate that with the last match.

Using a regex that matches just the timestamps and using a bit of procedural code to get the text between the timestamps will be (far) more efficient than trying to come up with a regex that matches the timestamp and everything up to the next timestamp.

Upvotes: 1

Alan Moore
Alan Moore

Reputation: 75272

It sounds like you want the regex to match the entire log sequence, from the timestamp to the end of the last line, including the line separator. Assuming every log sequence but the last one is followed immediately by another log sequence, you should be able to use a lookahead for a timestamp to find the end of the sequence.

Pattern sequencePattern = pattern.compile(
    "^timestamp.*?(?=timestamp|\z)",
    Pattern.DOTALL | Pattern.MULTILINE);

If that's not fast or accurate enough, this should work better:

Pattern sequencePattern = pattern.compile(
    "^timestamp.*+(?:(?:\r\n|[\r\n])(?!timestamp).*+)*+(?:\r\n|[\r\n])?",
    Pattern.MULTILINE);

Of course, I'm assuming you'll replace timestamp with the real timestamp regex. Just out of curiosity, have you considered using Scanner's findWithinHorizon method for this? Seems to me it could save you a lot of work.

Upvotes: 1

Bozho
Bozho

Reputation: 597422

I don't see any regex in your code, but here's a hint:

By defailt the dot . in regex matches everything except a new-line. If you want it to match a new line, you'd need Pattern.DOTALL as an argument to Pattern.compile(str, flags)

Another way to match new-lines is to use the predefined group \s which matches [\t\n\x0B\f\r]

Upvotes: 0

Related Questions