Need to create regex for analyzing rails server log

Question

I've a rails server log file, whose format is as follows.

Started   for  at 
  Processing by  as 
  Parameters: 

Rendered  ()
Completed  in 


Started   for  at 
  Processing by  as 
  Parameters: 

Completed  in

Now, I need to parse this log and extract all the REQUEST_TYPE, URL, IP, TIMESTAMP, REQUEST_FORMAT, RESPONSE_CODE from above log. I'm struggling to create a good regex for it in java/ruby. <> is not present in actual input. I've added for readability and masking of actual data.

Example request:

Started GET "/google.com/2" for 127.0.0.1 at Tue Dec 01 12:01:13 +0530 2015
  Processing by MyController#method as JS
  Parameters: {"abc" => "xyz"}
[LOG] 3 : User text log
Completed 200 OK in 26ms (Views: 3.3ms | ActiveRecord: 2.9ms)


Started POST "/google.com/543" for 127.0.1.1 at Tue Dec 01 13:13:16 +0530 2015
  Processing by MyController#method_2 as JSON
  Parameters: {"efg" => "uvw"}
Completed 404 Not Authorized in 65ms (Views: 1.5ms | ActiveRecord: 1.0ms)

Expected Output:

request_types = ['GET', 'POST']
urls = ['/google.com/2','/google.com/543']
ips = ['127.0.0.1','127.0.1.1']
timestamps = ['Tue Dec 01 12:01:13 +0530 2015','Tue Dec 01 13:13:16 +0530 2015']
request_formats = ['JS','JSON']
response_codes = ['200 OK','404 Not Authorized']

I was able to write following regex, but it doesn't work as expected.

request_types = /Started \w+/  //Expected array of all request types
urls = /"\/.*\/"/ //Expected array of all urls types
ips = /"d{1,3}.d{1,3}.d{1,3}.d{1,3}"/ //Expected array of all ips types
timestamps =  /at \w+/
request_formats =/as \w+/
response_codes = /Completed \w+/

I hope to get some help in creating regex for extracting this parameters from given input in JAVA/RUBY. I would prefer java, if possible.

Wiktor Stribiżew · Accepted Answer

Here is a Java snippet showing how to get the details from the log into separate array lists in Java:

String re = "(?sm)^Started\s+(?\S+)\s+"(?\S+)"\s+for\s+(?\d+(?:\.\d+)+)\s+at\s+(?[a-zA-Z]+\s+[a-zA-Z]+\s+\d+\s+\d+:\d+:\d+\s+\+\d+\s\d{4})\s+(?:Processing\s+by\s+\S+)\s+as\s+(?\S+)(?:\s+Parameters:\s+\S+)?(?:(?:(?:(?!
Started ).)*Completed\s)(?\d+(?:(?!\sin\s).)*))?";
String str = "Started GET "/google.com/2" for 127.0.0.1 at Tue Dec 01 12:01:13 +0530 2015
  Processing by MyController#method as JS
  Parameters: {"abc" => "xyz"}
[LOG] 3 : User text log
Completed 200 OK in 26ms (Views: 3.3ms | ActiveRecord: 2.9ms)


Started POST "/google.com/543" for 127.0.1.1 at Tue Dec 01 13:13:16 +0530 2015
  Processing by MyController#method_2 as JSON
  Parameters: {"efg" => "uvw"}
Completed 404 Not Authorized in 65ms (Views: 1.5ms | ActiveRecord: 1.0ms)";
Pattern pattern = Pattern.compile(re);
Matcher matcher = pattern.matcher(str);
List requesttypes = new ArrayList();
List urls = new ArrayList();
List ips = new ArrayList();
List timestamps = new ArrayList(); 
List requestformats = new ArrayList(); 
List responsecodes = new ArrayList();
while (matcher.find()){
    requesttypes.add(matcher.group("requesttype"));
    urls.add(matcher.group("url"));
    ips.add(matcher.group("ip"));
    timestamps.add(matcher.group("tsp"));
    requestformats.add(matcher.group("requestformat"));
    responsecodes.add(matcher.group("responsecode"));
    System.out.println("-----------------------");
    System.out.println(matcher.group("requesttype"));
    System.out.println(matcher.group("url")); 
    System.out.println(matcher.group("ip")); 
    System.out.println(matcher.group("tsp")); 
    System.out.println(matcher.group("requestformat")); 
    System.out.println(matcher.group("responsecode")); 
}

See the IDEONE demo. You can even print the arrays after you get the matching done with, e.g. System.out.println(urls):

System.out.println(requesttypes);
System.out.println(urls);
System.out.println(ips);
System.out.println(urls);
System.out.println(timestamps);
System.out.println(requestformats);
System.out.println(responsecodes);

See this demo. The output is:

[GET, POST]
[/google.com/2, /google.com/543]
[127.0.0.1, 127.0.1.1]
[/google.com/2, /google.com/543]
[Tue Dec 01 12:01:13 +0530 2015, Tue Dec 01 13:13:16 +0530 2015]
[JS, JSON]
[200 OK, 404 Not Authorized]

The regex matches:

(?sm)^ - start of a line (due to ^ and ?m option)
Started\s+ - literal Started string and 1+ whitespaces
(?\S+) - Group "request type" holding 1+ non-whitespace chars
\s+" - 1+ whitespace followed with "
(?\S+) - Group "url" holding 1+ non-whitespace
"\s+for\s+ - " followed with 1+ whitespace + for + 1+ whitespace
(?\d+(?:\.\d+)+) - IP group containing digits + . + digits (.+digits 1+ times)
\s+at\s+ - the word at surrounded with whitespace
(?[a-zA-Z]+\s+[a-zA-Z]+\s+\d+\s+\d+:\d+:\d+\s+\+\d+\s\d{4}) - timestamp group holding letter and digits in different order separated with whitespace acc. to the input examples
- \s+ - 1+ whitespace
(?:Processing\s+by\s+\S+)\s+as\s+ - Processing by followed with some word (1+ non-whitespaces) followed with the word as surrounded with whitespace
(?\S+) - Group "request format" that consists of non-whitespace symbols
(?:\s+Parameters:\s+\S+)? - optional group Paramters: followed with whitepspace(s) and some word
(?:(?:(?:(?! Started ).)*Completed\s)(?\d+(?:(?!\sin\s).)*))? - an optional group (since enclosed in (?:...)?) that matches any characters up to Completed, but that has no Started (due to the tempered greedy token (?:(?! Started ).)*), and then matches Completed followed with a whitespace, and then (?\d+(?:(?!\sin\s).)*) matches and captures into Group "response code" digits followed with any characters up to the whole word in surrounded with spaces.

Need to create regex for analyzing rails server log

Answers (1)

Related Questions