Reputation: 456
which is the best way to parse a log file like this using R?
- - - [20/Nov/2011:01:16:29 +0100] "POST /csw/servlet/cswservlet HTTP/1.1" 200 279
- - - [20/Nov/2011:01:16:29 +0100] "GET /DescargaFenomenos/index.jsp HTTP/1.1" 200 11769
- - - [20/Nov/2011:01:16:29 +0100] "GET /IDEE-ServicesSearch/ServicesSearch.html?locale=es HTTP/1.1" 200 1665
- - - [20/Nov/2011:01:16:29 +0100] "GET /search/indexLayout.jsp?PAGELANGUAGE=es HTTP/1.1" 200 9874
- - - [20/Nov/2011:01:16:29 +0100] "GET /clientesIGN/wmsGenericClient/index.html?lang=ES HTTP/1.1" 200 12058
- - - [20/Nov/2011:01:16:30 +0100] "POST /csw/servlet/cswservlet HTTP/1.1" 200 258038
- - - [20/Nov/2011:01:17:09 +0100] "GET //DescargaFenomenos/index.jsp HTTP/1.1" 200 11769
- - - [20/Nov/2011:01:17:33 +0100] "GET //DescargaFenomenos/index.jsp HTTP/1.1" 200 11769
- - - [20/Nov/2011:01:17:33 +0100] "GET //show.do?to=pideep_pidee.ES HTTP/1.1" 200 26647
192.168.69.10, 62.97.81.202 - - [20/Nov/2011:01:17:34 +0100] "POST /csw/?locale=es HTTP/1.0" 200 2536
192.168.69.10, 62.97.81.202 - - [20/Nov/2011:01:17:34 +0100] "GET /DescargaFenomenos/index.jsp HTTP/1.0" 200 11769
192.168.69.10, 62.97.81.202 - - [20/Nov/2011:01:17:34 +0100] "GET /clientesIGN/wmsGenericClient/index.html?lang=ES HTTP/1.0" 200 12058
- - - [20/Nov/2011:01:17:39 +0100] "GET //csw/servlet/cswservlet?request=GetCapabilities&service=CSW&version=2.0.2 HTTP/1.1" 200 8867
- - - [20/Nov/2011:01:17:46 +0100] "GET //csw/servlet/cswservlet?request=GetCapabilities&service=CSW&version=2.0.2 HTTP/1.1" 200 8867
- - - [20/Nov/2011:01:18:10 +0100] "GET //show.do?to=pideep_pidee.ES HTTP/1.1" 200 26647
- - - [20/Nov/2011:01:19:01 +0100] "GET //DescargaFenomenos/index.jsp HTTP/1.1" 200 11769
I must consider border cases like having 2 IP's in a single line (internal and external).
Thanks!
Upvotes: 1
Views: 522
Reputation: 263451
For this example it suffices to replace the leading dashes with two NA's and the commas with spaces. You can then parse with read.table()
datlog <- readLines(textConnection('- - - [20/Nov/2011:01:16:29 +0100] "POST /csw/servlet/cswservlet HTTP/1.1" 200 279
- - - [20/Nov/2011:01:16:29 +0100] "GET /DescargaFenomenos/index.jsp HTTP/1.1" 200 11769
- - - [20/Nov/2011:01:16:29 +0100] "GET /IDEE-ServicesSearch/ServicesSearch.html?locale=es HTTP/1.1" 200 1665
- - - [20/Nov/2011:01:16:29 +0100] "GET /search/indexLayout.jsp?PAGELANGUAGE=es HTTP/1.1" 200 9874
- - - [20/Nov/2011:01:16:29 +0100] "GET /clientesIGN/wmsGenericClient/index.html?lang=ES HTTP/1.1" 200 12058
- - - [20/Nov/2011:01:16:30 +0100] "POST /csw/servlet/cswservlet HTTP/1.1" 200 258038
- - - [20/Nov/2011:01:17:09 +0100] "GET //DescargaFenomenos/index.jsp HTTP/1.1" 200 11769
- - - [20/Nov/2011:01:17:33 +0100] "GET //DescargaFenomenos/index.jsp HTTP/1.1" 200 11769
- - - [20/Nov/2011:01:17:33 +0100] "GET //show.do?to=pideep_pidee.ES HTTP/1.1" 200 26647
192.168.69.10, 62.97.81.202 - - [20/Nov/2011:01:17:34 +0100] "POST /csw/?locale=es HTTP/1.0" 200 2536
192.168.69.10, 62.97.81.202 - - [20/Nov/2011:01:17:34 +0100] "GET /DescargaFenomenos/index.jsp HTTP/1.0" 200 11769
192.168.69.10, 62.97.81.202 - - [20/Nov/2011:01:17:34 +0100] "GET /clientesIGN/wmsGenericClient/index.html?lang=ES HTTP/1.0" 200 12058
- - - [20/Nov/2011:01:17:39 +0100] "GET //csw/servlet/cswservlet?request=GetCapabilities&service=CSW&version=2.0.2 HTTP/1.1" 200 8867
- - - [20/Nov/2011:01:17:46 +0100] "GET //csw/servlet/cswservlet?request=GetCapabilities&service=CSW&version=2.0.2 HTTP/1.1" 200 8867
- - - [20/Nov/2011:01:18:10 +0100] "GET //show.do?to=pideep_pidee.ES HTTP/1.1" 200 26647
- - - [20/Nov/2011:01:19:01 +0100] "GET //DescargaFenomenos/index.jsp HTTP/1.1" 200 11769'))
datlog <- gsub("^-", "NA NA", datlog)
datlog <- sub("\\,", " ", datlog)
datlog<-read.table(text=datlog, fill=TRUE)
datlog
Spacedman asked about datetime parsing:
datlog[['dtime']] <- as.POSIXct( paste( sub("\\[", "", datlog[[5]]),
sub("\\]", "", datlog[[6]]) ),
format="%d/%b/%Y:%H:%M:%S %z")
Upvotes: 3