Reputation: 4647
I'm using the Python 3 CSV Reader to read some web-log files into a namedtuple. I have no control over the log file structures, and there are varying types.
The delimiter is a space ( ), the problem is that some log file formats place a space in the timestamp, as Logfile 2 below. The CSV reader then reads the date/time stamp as two fields.
Logfile 1
73 58 2993 [22/Jul/2016:06:51:06.299] 2[2] "GET /example HTTP/1.1"
13 58 224 [22/Jul/2016:06:51:06.399] 2[2] "GET /example HTTP/1.1"
Logfile 2
13 58 224 [22/Jul/2016:06:51:06 +0000] 2[2] "GET /test HTTP/1.1"
153 38 224 [22/Jul/2016:06:51:07 +0000] 2[2] "GET /test HTTP/1.1"
The log files typically have the timestamp within square quotes, but I cannot find a way of handling them as "quotes". On top of that, square brackets are not always used as quotes within the logs either (see the [2] later in the logs).
I've read through the Python 3 CSV Reader documentation, including about dialects, but there doesn't seem to be anything for handling enclosing square brackets.
How can I handle this situation automatically?
Upvotes: 2
Views: 812
Reputation: 759
Bruteforce! I concatenate the 2 time fields so the timestamp can be interpreted if needed.
data.csv =
13 58 224 [22/Jul/2016:06:51:06.399] 2[2] "GET /example HTTP/1.1"
13 58 224 [22/Jul/2016:06:51:06 +0000] 2[2] "GET /test HTTP/1.1"
import csv
with open("data.csv") as f:
c = csv.reader(f, delimiter=" ")
for row in c:
if len(row) == 7:
r = row[:3] + [row[3] + row[4]] + row[5:]
print(r)
else:
print(row)
['13', '58', '224', '[22/Jul/2016:06:51:06.399]', '2[2]', 'GET /example HTTP/1.1']
['13', '58', '224', '[22/Jul/2016:06:51:06+0000]', '2[2]', 'GET /test HTTP/1.1']
Upvotes: 0
Reputation: 2847
This will do, you need to use a regex in place of sep.
This for example will parse NGinx log files into a pandas.Dataframe
:
import pandas as pd
df = pd.read_csv(log_file,
sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])',
engine='python',
usecols=[0, 3, 4, 5, 6, 7, 8],
names=['ip', 'time', 'request', 'status', 'size', 'referer', 'user_agent'],
na_values='-',
header=None
)
Edit :
line = '172.16.0.3 - - [25/Sep/2002:14:04:19 +0200] "GET / HTTP/1.1" 401 - "" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827"'
regex = '([(\d\.)]+) - - \[(.*?)\] "(.*?)" (\d+) - "(.*?)" "(.*?)"'
import re
print re.match(regex, line).groups()
The output would be a tuple with 6 pieces of information
('172.16.0.3', '25/Sep/2002:14:04:19 +0200', 'GET / HTTP/1.1', '401', '', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827')
Upvotes: 1