Abhishek
Abhishek

Reputation: 7045

Need to create regex for analyzing rails server log

I've a rails server log file, whose format is as follows.

Started <REQUEST_TYPE_1> <URL_1> for <IP_1> at <TIMESTAMP_1>
  Processing by <controller#action_1> as <REQUEST_FORMAT_1>
  Parameters: <parameters_1>
<Some logs from code>
Rendered <some_template_1> (<timetaken_1>)
Completed <RESPONSE_CODE_1> in <TIME_1>


Started <REQUEST_REQUEST_TYPE_2> <URL_2> for <IP_2> at <TIMESTAMP_2>
  Processing by <controller#action_2> as <REQUEST_FORMAT_2>
  Parameters: <parameters_2>
<Some logs from code>
Completed <RESPONSE_CODE_2> in <TIME_2>

Now, I need to parse this log and extract all the REQUEST_TYPE, URL, IP, TIMESTAMP, REQUEST_FORMAT, RESPONSE_CODE from above log. I'm struggling to create a good regex for it in java/ruby. <> is not present in actual input. I've added for readability and masking of actual data.

Example request:

Started GET "/google.com/2" for 127.0.0.1 at Tue Dec 01 12:01:13 +0530 2015
  Processing by MyController#method as JS
  Parameters: {"abc" => "xyz"}
[LOG] 3 : User text log
Completed 200 OK in 26ms (Views: 3.3ms | ActiveRecord: 2.9ms)


Started POST "/google.com/543" for 127.0.1.1 at Tue Dec 01 13:13:16 +0530 2015
  Processing by MyController#method_2 as JSON
  Parameters: {"efg" => "uvw"}
Completed 404 Not Authorized in 65ms (Views: 1.5ms | ActiveRecord: 1.0ms)

Expected Output:

request_types = ['GET', 'POST']
urls = ['/google.com/2','/google.com/543']
ips = ['127.0.0.1','127.0.1.1']
timestamps = ['Tue Dec 01 12:01:13 +0530 2015','Tue Dec 01 13:13:16 +0530 2015']
request_formats = ['JS','JSON']
response_codes = ['200 OK','404 Not Authorized']

I was able to write following regex, but it doesn't work as expected.

request_types = /Started \w+/  //Expected array of all request types
urls = /"\/.*\/"/ //Expected array of all urls types
ips = /"d{1,3}.d{1,3}.d{1,3}.d{1,3}"/ //Expected array of all ips types
timestamps =  /at \w+/
request_formats =/as \w+/
response_codes = /Completed \w+/

I hope to get some help in creating regex for extracting this parameters from given input in JAVA/RUBY. I would prefer java, if possible.

Upvotes: 1

Views: 216

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627468

Here is a Java snippet showing how to get the details from the log into separate array lists in Java:

String re = "(?sm)^Started\\s+(?<requesttype>\\S+)\\s+\"(?<url>\\S+)\"\\s+for\\s+(?<ip>\\d+(?:\\.\\d+)+)\\s+at\\s+(?<tsp>[a-zA-Z]+\\s+[a-zA-Z]+\\s+\\d+\\s+\\d+:\\d+:\\d+\\s+\\+\\d+\\s\\d{4})\\s+(?:Processing\\s+by\\s+\\S+)\\s+as\\s+(?<requestformat>\\S+)(?:\\s+Parameters:\\s+\\S+)?(?:(?:(?:(?!\nStarted ).)*Completed\\s)(?<responsecode>\\d+(?:(?!\\sin\\s).)*))?";
String str = "Started GET \"/google.com/2\" for 127.0.0.1 at Tue Dec 01 12:01:13 +0530 2015\n  Processing by MyController#method as JS\n  Parameters: {\"abc\" => \"xyz\"}\n[LOG] 3 : User text log\nCompleted 200 OK in 26ms (Views: 3.3ms | ActiveRecord: 2.9ms)\n\n\nStarted POST \"/google.com/543\" for 127.0.1.1 at Tue Dec 01 13:13:16 +0530 2015\n  Processing by MyController#method_2 as JSON\n  Parameters: {\"efg\" => \"uvw\"}\nCompleted 404 Not Authorized in 65ms (Views: 1.5ms | ActiveRecord: 1.0ms)";
Pattern pattern = Pattern.compile(re);
Matcher matcher = pattern.matcher(str);
List<String> requesttypes = new ArrayList<String>();
List<String> urls = new ArrayList<String>();
List<String> ips = new ArrayList<String>();
List<String> timestamps = new ArrayList<String>(); 
List<String> requestformats = new ArrayList<String>(); 
List<String> responsecodes = new ArrayList<String>();
while (matcher.find()){
    requesttypes.add(matcher.group("requesttype"));
    urls.add(matcher.group("url"));
    ips.add(matcher.group("ip"));
    timestamps.add(matcher.group("tsp"));
    requestformats.add(matcher.group("requestformat"));
    responsecodes.add(matcher.group("responsecode"));
    System.out.println("-----------------------");
    System.out.println(matcher.group("requesttype"));
    System.out.println(matcher.group("url")); 
    System.out.println(matcher.group("ip")); 
    System.out.println(matcher.group("tsp")); 
    System.out.println(matcher.group("requestformat")); 
    System.out.println(matcher.group("responsecode")); 
} 

See the IDEONE demo. You can even print the arrays after you get the matching done with, e.g. System.out.println(urls):

System.out.println(requesttypes);
System.out.println(urls);
System.out.println(ips);
System.out.println(urls);
System.out.println(timestamps);
System.out.println(requestformats);
System.out.println(responsecodes);

See this demo. The output is:

[GET, POST]
[/google.com/2, /google.com/543]
[127.0.0.1, 127.0.1.1]
[/google.com/2, /google.com/543]
[Tue Dec 01 12:01:13 +0530 2015, Tue Dec 01 13:13:16 +0530 2015]
[JS, JSON]
[200 OK, 404 Not Authorized]

The regex matches:

  • (?sm)^ - start of a line (due to ^ and ?m option)
  • Started\\s+ - literal Started string and 1+ whitespaces
  • (?<requesttype>\\S+) - Group "request type" holding 1+ non-whitespace chars
  • \\s+\" - 1+ whitespace followed with "
  • (?<url>\\S+) - Group "url" holding 1+ non-whitespace
  • \"\\s+for\\s+ - " followed with 1+ whitespace + for + 1+ whitespace
  • (?<ip>\\d+(?:\\.\\d+)+) - IP group containing digits + . + digits (.+digits 1+ times)
  • \\s+at\\s+ - the word at surrounded with whitespace
  • (?<tsp>[a-zA-Z]+\\s+[a-zA-Z]+\\s+\\d+\\s+\\d+:\\d+:\\d+\\s+\\+\\d+\\s\\d{4}) - timestamp group holding letter and digits in different order separated with whitespace acc. to the input examples
    • \\s+ - 1+ whitespace
  • (?:Processing\\s+by\\s+\\S+)\\s+as\\s+ - Processing by followed with some word (1+ non-whitespaces) followed with the word as surrounded with whitespace
  • (?<requestformat>\\S+) - Group "request format" that consists of non-whitespace symbols
  • (?:\\s+Parameters:\\s+\\S+)? - optional group Paramters: followed with whitepspace(s) and some word
  • (?:(?:(?:(?!\nStarted ).)*Completed\\s)(?<responsecode>\\d+(?:(?!\\sin\\s).)*))? - an optional group (since enclosed in (?:...)?) that matches any characters up to Completed, but that has no Started (due to the tempered greedy token (?:(?!\nStarted ).)*), and then matches Completed followed with a whitespace, and then (?<responsecode>\\d+(?:(?!\\sin\\s).)*) matches and captures into Group "response code" digits followed with any characters up to the whole word in surrounded with spaces.

Upvotes: 2

Related Questions