Reputation: 7045
I've a rails server log file, whose format is as follows.
Started <REQUEST_TYPE_1> <URL_1> for <IP_1> at <TIMESTAMP_1>
Processing by <controller#action_1> as <REQUEST_FORMAT_1>
Parameters: <parameters_1>
<Some logs from code>
Rendered <some_template_1> (<timetaken_1>)
Completed <RESPONSE_CODE_1> in <TIME_1>
Started <REQUEST_REQUEST_TYPE_2> <URL_2> for <IP_2> at <TIMESTAMP_2>
Processing by <controller#action_2> as <REQUEST_FORMAT_2>
Parameters: <parameters_2>
<Some logs from code>
Completed <RESPONSE_CODE_2> in <TIME_2>
Now, I need to parse this log and extract all the REQUEST_TYPE
, URL
, IP
, TIMESTAMP
, REQUEST_FORMAT
, RESPONSE_CODE
from above log. I'm struggling to create a good regex for it in java/ruby. <>
is not present in actual input. I've added for readability and masking of actual data.
Example request:
Started GET "/google.com/2" for 127.0.0.1 at Tue Dec 01 12:01:13 +0530 2015
Processing by MyController#method as JS
Parameters: {"abc" => "xyz"}
[LOG] 3 : User text log
Completed 200 OK in 26ms (Views: 3.3ms | ActiveRecord: 2.9ms)
Started POST "/google.com/543" for 127.0.1.1 at Tue Dec 01 13:13:16 +0530 2015
Processing by MyController#method_2 as JSON
Parameters: {"efg" => "uvw"}
Completed 404 Not Authorized in 65ms (Views: 1.5ms | ActiveRecord: 1.0ms)
Expected Output:
request_types = ['GET', 'POST']
urls = ['/google.com/2','/google.com/543']
ips = ['127.0.0.1','127.0.1.1']
timestamps = ['Tue Dec 01 12:01:13 +0530 2015','Tue Dec 01 13:13:16 +0530 2015']
request_formats = ['JS','JSON']
response_codes = ['200 OK','404 Not Authorized']
I was able to write following regex, but it doesn't work as expected.
request_types = /Started \w+/ //Expected array of all request types
urls = /"\/.*\/"/ //Expected array of all urls types
ips = /"d{1,3}.d{1,3}.d{1,3}.d{1,3}"/ //Expected array of all ips types
timestamps = /at \w+/
request_formats =/as \w+/
response_codes = /Completed \w+/
I hope to get some help in creating regex for extracting this parameters from given input in JAVA/RUBY. I would prefer java, if possible.
Upvotes: 1
Views: 216
Reputation: 627468
Here is a Java snippet showing how to get the details from the log into separate array lists in Java:
String re = "(?sm)^Started\\s+(?<requesttype>\\S+)\\s+\"(?<url>\\S+)\"\\s+for\\s+(?<ip>\\d+(?:\\.\\d+)+)\\s+at\\s+(?<tsp>[a-zA-Z]+\\s+[a-zA-Z]+\\s+\\d+\\s+\\d+:\\d+:\\d+\\s+\\+\\d+\\s\\d{4})\\s+(?:Processing\\s+by\\s+\\S+)\\s+as\\s+(?<requestformat>\\S+)(?:\\s+Parameters:\\s+\\S+)?(?:(?:(?:(?!\nStarted ).)*Completed\\s)(?<responsecode>\\d+(?:(?!\\sin\\s).)*))?";
String str = "Started GET \"/google.com/2\" for 127.0.0.1 at Tue Dec 01 12:01:13 +0530 2015\n Processing by MyController#method as JS\n Parameters: {\"abc\" => \"xyz\"}\n[LOG] 3 : User text log\nCompleted 200 OK in 26ms (Views: 3.3ms | ActiveRecord: 2.9ms)\n\n\nStarted POST \"/google.com/543\" for 127.0.1.1 at Tue Dec 01 13:13:16 +0530 2015\n Processing by MyController#method_2 as JSON\n Parameters: {\"efg\" => \"uvw\"}\nCompleted 404 Not Authorized in 65ms (Views: 1.5ms | ActiveRecord: 1.0ms)";
Pattern pattern = Pattern.compile(re);
Matcher matcher = pattern.matcher(str);
List<String> requesttypes = new ArrayList<String>();
List<String> urls = new ArrayList<String>();
List<String> ips = new ArrayList<String>();
List<String> timestamps = new ArrayList<String>();
List<String> requestformats = new ArrayList<String>();
List<String> responsecodes = new ArrayList<String>();
while (matcher.find()){
requesttypes.add(matcher.group("requesttype"));
urls.add(matcher.group("url"));
ips.add(matcher.group("ip"));
timestamps.add(matcher.group("tsp"));
requestformats.add(matcher.group("requestformat"));
responsecodes.add(matcher.group("responsecode"));
System.out.println("-----------------------");
System.out.println(matcher.group("requesttype"));
System.out.println(matcher.group("url"));
System.out.println(matcher.group("ip"));
System.out.println(matcher.group("tsp"));
System.out.println(matcher.group("requestformat"));
System.out.println(matcher.group("responsecode"));
}
See the IDEONE demo. You can even print the arrays after you get the matching done with, e.g. System.out.println(urls)
:
System.out.println(requesttypes);
System.out.println(urls);
System.out.println(ips);
System.out.println(urls);
System.out.println(timestamps);
System.out.println(requestformats);
System.out.println(responsecodes);
See this demo. The output is:
[GET, POST]
[/google.com/2, /google.com/543]
[127.0.0.1, 127.0.1.1]
[/google.com/2, /google.com/543]
[Tue Dec 01 12:01:13 +0530 2015, Tue Dec 01 13:13:16 +0530 2015]
[JS, JSON]
[200 OK, 404 Not Authorized]
The regex matches:
(?sm)^
- start of a line (due to ^
and ?m
option)Started\\s+
- literal Started
string and 1+ whitespaces(?<requesttype>\\S+)
- Group "request type" holding 1+ non-whitespace chars\\s+\"
- 1+ whitespace followed with "
(?<url>\\S+)
- Group "url" holding 1+ non-whitespace\"\\s+for\\s+
- "
followed with 1+ whitespace + for
+ 1+ whitespace(?<ip>\\d+(?:\\.\\d+)+)
- IP group containing digits + .
+ digits (.
+digits 1+ times)\\s+at\\s+
- the word at
surrounded with whitespace(?<tsp>[a-zA-Z]+\\s+[a-zA-Z]+\\s+\\d+\\s+\\d+:\\d+:\\d+\\s+\\+\\d+\\s\\d{4})
- timestamp group holding letter and digits in different order separated with whitespace acc. to the input examples
\\s+
- 1+ whitespace(?:Processing\\s+by\\s+\\S+)\\s+as\\s+
- Processing by
followed with some word (1+ non-whitespaces) followed with the word as
surrounded with whitespace(?<requestformat>\\S+)
- Group "request format" that consists of non-whitespace symbols(?:\\s+Parameters:\\s+\\S+)?
- optional group Paramters:
followed with whitepspace(s) and some word(?:(?:(?:(?!\nStarted ).)*Completed\\s)(?<responsecode>\\d+(?:(?!\\sin\\s).)*))?
- an optional group (since enclosed in (?:...)?
) that matches any characters up to Completed
, but that has no Started
(due to the tempered greedy token (?:(?!\nStarted ).)*
), and then matches Completed
followed with a whitespace, and then (?<responsecode>\\d+(?:(?!\\sin\\s).)*)
matches and captures into Group "response code" digits followed with any characters up to the whole word in
surrounded with spaces.Upvotes: 2