karthick
karthick

Reputation: 6158

Extracting the log using regex - java?

I am having the following method for log separation. Log format is exactly same as below but values may change

29-11-2013 19:18:53 192.2.2.22 66 192.2.2.22 8080 GET 402 103 103 HTTP/1.1 192.2.2.22 http://in.sample.com/parties/ Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.2.149.29 Safari/525.13

Code as follows:

     String regex = "^([0-9-]*)\\s([0-9:]*)\\s([0-9\\\\.]*)\\s([0-9]*|-)\\s([0-9\\\\.]*)\\s([0-9]*)\\s(GET|POST)\\s([0-9]*)\\s([0-9]*)\\s([0-9]*)\\s([a-zA-Z0-9\\\\./]*)\\s([a-zA-Z0-9:./]*)\\s(.*)\\s(.*)";
     String pattern = "$1~~$2~~$3~~$4~~$5~~$6~~$7~~$8~~$9~~$10~~$11~~$12~~$13~~$14";
     String values = "29-11-2013 19:18:53 192.2.2.22 66 192.2.2.22 8080 GET 402 103 103 HTTP/1.1 192.2.2.22 http://in.sample.com/parties/ Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.2.149.29 Safari/525.13";
     List<Object> params = new ArrayList<Object>();
     String formattedString = values.replaceAll(regex, pattern);
     String[] fields = formattedString.split("~~");
     for (String field : fields) {
        params.add(field);
      }
     System.out.println(params);

Problem Facing:

It is not splitting the log correctly.

After url : http://in.sample.com/parties/ is the problem.

Useragent consists of spaces. So log separartion is not working as expected.

Output

[29-11-2013, 19:18:53, 192.2.2.22, 66, 192.2.2.22, 8080, GET, 402, 103, 103, HTTP/1.1, 192.2.2.22, http://in.sample.com/parties/ Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.2.149.29, Safari/525.13]

Required Output:

[29-11-2013, 19:18:53, 192.2.2.22, 66, 192.2.2.22, 8080, GET, 402, 103, 103, HTTP/1.1, 192.2.2.22, http://in.sample.com/parties/, Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML like Gecko) Chrome/0.2.149.29 Safari/525.13]

Any help will be great.

Upvotes: 0

Views: 96

Answers (2)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89547

You don't need a regex to do that. Since your log contains always 14 fields and since the problematics spaces are in the last field, all you need is to use the split method with the second parameter (limit):

String[] fields = values.split(" ", 14);

Upvotes: 1

anubhava
anubhava

Reputation: 784918

I believe you're missing matching HTTP/1.1 part. Try this regex:

String regex = "(?i)^([0-9-]*)\\s([0-9:]*)\\s([0-9.]*)\\s([0-9]*|-)\\s([0-9.]*)\\s([0-9]*)\\s(GET|POST)\\s([0-9]*)\\s([0-9]*)\\s([0-9]*)\\s(HTTP\/1\.[01])\s([A-Z0-9./]*)\\s([A-Z0-9:./]*)\\s(.*)";

It gives:

["29-11-2013 19:18:53 192.2.2.22 66 192.2.2.22 8080 GET 402 103 103 HTTP/1.1 192.2.2.22 http://in.sample.com/parties/ Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.2.149.29 Safari/525.13", "29-11-2013", "19:18:53", "192.2.2.22", "66", "192.2.2.22", "8080", "GET", "402", "103", "103", "HTTP/1.1", "192.2.2.22", "http://in.sample.com/parties/", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.2.149.29 Safari/525.13"]

As an alternative you can try to find & use a dedicated log parser.

Upvotes: 0

Related Questions