gitmorty
gitmorty

Reputation: 273

Filtering logs with regex in java

The description is quite long, so please bear with me:
I have log files ranging from 300 mb to 1.5 Gb in size, which need to be filtered given a search key.

The format of the logs is something like this:

24 May 2017 17:00:06,827 [INFO] 123456 (Blah : Blah1) Service-name:: Single line content
24 May 2017 17:00:06,828 [INFO] 567890 (Blah : Blah1) Service-name:: Content( May span multiple lines)
24 May 2017 17:00:06,829 [INFO] 123456 (Blah : Blah2) Service-name: Multiple line content. Printing Object[ ID1=fac-adasd ID2=123231
ID3=123108 Status=Unknown
Code=530007 Dest=CA
]
24 May 2017 17:00:06,830 [INFO] 123456 (Blah : Blah1) Service-name:: Single line content
4 May 2017 17:00:06,831 [INFO] 567890 (Blah : Blah2) Service-name:: Content( May span multiple lines)

Given the search key 123456, I need to fetch the following:

24 May 2017 17:00:06,827 [INFO] 123456 (Blah : Blah1) Service-name:: Single line content
24 May 2017 17:00:06,829 [INFO] 123456 (Blah : Blah2) Service-name: Multiple line content. Printing Object[ ID1=fac-adasd ID2=123231
ID3=123108 Status=Unknown
Code=530007 Dest=CA
]
24 May 2017 17:00:06,830 [INFO] 123456 (Blah : Blah1) Service-name:: Single line content

The following awk script gets my job done(very slowly):

gawk '/([0-9]{1}|[0-9]{2})\s\w+\s[0-9]{4}/{n=0}/123456/{n=1} n'

It takes around 8 minutes to search a log file of 1 gb size. And I need to do this for many such files. To top it off, I have multiple such search keys, which makes the whole task kind of impossible.

My initial solution is to use multithreading. I have used a fixedThreadPoolExecutor, submitted a task for each file that needs to be filtered. Inside the task description, I have spawned new process using java's Runtime(), which would execute the gawk script using bash and write the output to a file and then merged all the files.

Although that might seem like a poor way to go about, since the filtering is I/O dependent rather than CPU, it did give me a speedup compared to executing the script on each file sequentially.

But it still isn't sufficient as the whole thing takes 2 hrs, for a single search key, with 27gb of log files. On an average, I have 4 such search keys and need to fetch all of their results and put them together.

My method isn't efficient because:

A) It accesses each log file multiple times when multiple search keys are given and causes even more I/O overhead.
B) It incurs the overhead of creating a process inside each thread.

A simple solution to all of this, is moving away from awk and doing the whole thing in java, using some regex library. The question here is what is that regex library that could provide me with the desired output?
With awk I have the /filter/{action} property which lets me specify a range of multiple lines, to be captured(as seen above). How can I do the same inside java ?

I'm open to all kinds of suggestions.For example, an extreme option would be to store the log files in a shared filesystem like S3 and process the output using multiple computers.

I'm new to stackoverflow and I don't even know if I can post this here. But I've been working on this for the past week and I need someone with expertise to guide me on this. Thanks in advance.

Upvotes: 8

Views: 985

Answers (2)

Leo Aso
Leo Aso

Reputation: 12463

Switching to Java might not be the best option if you're looking to speed up your execution time, but if you're considering it, I wrote a Java class that might help.

You can use it to search one or more keys in a file simultaneously. Since you are reading a log file, it is safe to assume that all lines follow the proper format without errors. So instead of regex format-checking the whole line, it simply skips to where the key should be (the digits after the first ]), and compares it to the required value (assuming it is always a number).

Use it this way:

Set<Integer> keys = new HashSet();
keys.add(123456);
keys.add(314159);
/* synchronously (omitting 3rd argument prints to stdout) */
new KeySearch('path/to/file.log', keys).run();

/* asynchronously!!! (to use PrintStream, create the output file first) */
PrintStream ps1 = new PrintStream('lines-found1.log');
PrintStream ps2 = new PrintStream('lines-found2.log');
new Thread(new KeySearch('path/to/1.log', keys, ps1::println)).start();
new Thread(new KeySearch('path/to/2.log', keys, ps2::println)).start();

The third argument is a custom interface KeySearch.Callback which receives lines as they are found. I use a method reference as an example, but it can be anything you want. Here is the class (requires Java 8 at least).

import java.io.*;
import java.util.*;

public class KeySearch implements Runnable {
    public interface Callback { 
        void lineFound(String line); 
    }

    private final Set<Integer> keys;
    private final Callback callback;
    private final String name;

    public KeySearch(String fileName, Collection<Integer> keys) {
        this(fileName, keys, System.out::println);
    }

    public KeySearch(String fileName, Collection<Integer> keys, Callback call) {
        this.keys = new HashSet<>(keys);
        this.name = fileName;
        this.callback = call;
    }

    @Override
    public void run() {
        String s;
        try(FileReader fr = new FileReader(name); 
                BufferedReader br = new BufferedReader(fr)) {
            while ((s = readLine(br)) != null)
                if (matches(s)) callback.lineFound(s);
        } catch (IOException e) {
            System.err.println("Error reading " + name);
            throw new RuntimeException(e);
        }
    }

    private boolean matches(String line) {
        return keys.contains(getKeyOf(line));
    }

    private String readLine(BufferedReader reader) throws IOException {
        StringBuilder line = new StringBuilder();
        String next;

        do {
            next = reader.readLine();
            if (next == null) return null;
            line.append(next).append(System.lineSeparator());
        } while (next.lastIndexOf('[') > next.lastIndexOf(']'));

        return line.toString();
    }

    private boolean isDigit(CharSequence s, int i) {
        char c = s.charAt(i);
        return c >= '0' && c <= '9';
    }

    private int getKeyOf(String line) {
        // find the first ] (e.g. at the end of [INFO])
        // and read the first number after it
        int start = line.indexOf(']');
        while (!isDigit(line, start)) start++;

        int end = start;
        while (isDigit(line, end)) end++;

        return Integer.parseInt(line.substring(start, end));
    }
}

Upvotes: 0

Dinu Sorin
Dinu Sorin

Reputation: 215

You have a few options.

The best one imo would be to use an inversed dictionary. That means that for each keyword x present in at least one of the logs you store a reference to all logs that contain it. But as you already spent a week on this task I'd advice to use something that's already there and does exactly that: Elasticsearch. You can actually use the full ELK stack (elasticsearch, logstash, kibana - designed mainly for logs) to even parse the logs as you can just put a regex expression in the config file. You will only need to index the files once and will get searches as fast as a few milliseconds.

If you really want to waste energy and not go for the best solution, you can use map-reduce on hadoop to filter the log. But that's not a task where map-reduce is optimal and it would be more like a hack.

Upvotes: 1

Related Questions