Reputation: 1387
I have to find the commonly occuring IP addresses from apache logs.
12.1.12.1 9000 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"
12.1.12.1 9000 192.145.1.23 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"
How do I extract the IP addresses (i.e. 3rd word in each line) using regular expressions in Java? Also i have to find most common IP Addresses from it, for finding out robotic access. The log contains millions of lines, so regexp may be suitable for this.
Upvotes: 1
Views: 4448
Reputation: 2551
As others have pointed out, you don't need regexes. You shouldn't use String.split either, since it uses regexes as well. You could use StringTokenizer instead. Assuming you use BufferedReader br to read in each line:
String line = br.readLine();
StringTokenizer st = new StringTokenizer(line, " ");
st.nextToken();
st.nextToken();
String ip = st.nextToken();
Upvotes: 3
Reputation: 13728
The format of the access log file always depends on the configuration file settings. It would be probably better instead of assuming that the IP-address is the third 'word', to read the current configuration file and parse the access log file accordingly to the LogFormat
entry.
Apache httpd operates in accordance to httpd.conf and Tomcat to server.xml. server.xml is an XML file and that makes parsing the AccessLogValve a standard procedure.
This is a little more work, but it will make your application more flexible, in case it is necessary to persist. For this approach, i think, string methods will be easier to use than regular expressions.
Upvotes: 0
Reputation: 421280
Here is one solution:
String str1 = "12.1.12.1 9000 127.0.0.1 - frank [10/Oct/2000:13:55:36"
+ " -0700] \"GET /apache_pb.gif HTTP/1.0\" 200 2326 "
+ "\"http://www.example.com/start.html\" \"Mozilla/4.08 "
+ "[en] (Win98; I ;Nav)\"";
String str2 = "12.1.12.1 9000 192.145.1.23 - frank [10/Oct/2000:13:55"
+ ":36 -0700] \"GET /apache_pb.gif HTTP/1.0\" 200 2326 "
+ "\"http://www.example.com/start.html\" \"Mozilla/4.08 "
+ "[en] (Win98; I ;Nav)\"";
Pattern p = Pattern.compile("\\S+\\s+\\S+\\s+(\\S+).*");
Matcher m = p.matcher(str1);
if (m.matches())
System.out.println(m.group(1));
m = p.matcher(str2);
if (m.matches())
System.out.println(m.group(1));
Reg-ex breakdown:
\S+
, one or more non-white space characters.\s+
, one or more white space characters.(\\S+)
one or more non-white space characters, captured in group 1.Upvotes: 1
Reputation: 2645
If you are certain that it is always the 3rd word (as you said), maybe you don't need regular expressions at all. You could just take the third word via a simple split.
However, someone asked already that: Regular expression to match DNS hostname or IP Address?...
Upvotes: 3