GIS-Jonathan
GIS-Jonathan

Reputation: 4647

Handling web-log files as a CSV with Python

I'm using the Python 3 CSV Reader to read some web-log files into a namedtuple. I have no control over the log file structures, and there are varying types.

The delimiter is a space ( ), the problem is that some log file formats place a space in the timestamp, as Logfile 2 below. The CSV reader then reads the date/time stamp as two fields.

Logfile 1

73 58 2993 [22/Jul/2016:06:51:06.299] 2[2] "GET /example HTTP/1.1"
13 58 224 [22/Jul/2016:06:51:06.399] 2[2] "GET /example HTTP/1.1"

Logfile 2

13 58 224 [22/Jul/2016:06:51:06 +0000] 2[2] "GET /test HTTP/1.1"
153 38 224 [22/Jul/2016:06:51:07 +0000] 2[2] "GET /test HTTP/1.1"

The log files typically have the timestamp within square quotes, but I cannot find a way of handling them as "quotes". On top of that, square brackets are not always used as quotes within the logs either (see the [2] later in the logs).

I've read through the Python 3 CSV Reader documentation, including about dialects, but there doesn't seem to be anything for handling enclosing square brackets.

How can I handle this situation automatically?

Upvotes: 2

Views: 812

Answers (2)

knowingpark
knowingpark

Reputation: 759

Bruteforce! I concatenate the 2 time fields so the timestamp can be interpreted if needed.

   data.csv = 
13 58 224 [22/Jul/2016:06:51:06.399] 2[2] "GET /example HTTP/1.1" 
13 58 224 [22/Jul/2016:06:51:06 +0000] 2[2] "GET /test HTTP/1.1"


import csv

with open("data.csv") as f:
    c = csv.reader(f, delimiter=" ")
    for row in c:
        if len(row) == 7:
            r = row[:3] + [row[3] + row[4]] + row[5:]
            print(r)
        else:
            print(row)

['13', '58', '224', '[22/Jul/2016:06:51:06.399]', '2[2]', 'GET /example HTTP/1.1']
['13', '58', '224', '[22/Jul/2016:06:51:06+0000]', '2[2]', 'GET /test HTTP/1.1']

Upvotes: 0

SerialDev
SerialDev

Reputation: 2847

This will do, you need to use a regex in place of sep.
This for example will parse NGinx log files into a pandas.Dataframe:

import pandas as pd

df = pd.read_csv(log_file,
              sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])',
              engine='python',
              usecols=[0, 3, 4, 5, 6, 7, 8],
              names=['ip', 'time', 'request', 'status', 'size', 'referer', 'user_agent'],
              na_values='-',
              header=None
                )

Edit :

line = '172.16.0.3 - - [25/Sep/2002:14:04:19 +0200] "GET / HTTP/1.1" 401 - "" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827"'
regex = '([(\d\.)]+) - - \[(.*?)\] "(.*?)" (\d+) - "(.*?)" "(.*?)"'

import re
print re.match(regex, line).groups()

The output would be a tuple with 6 pieces of information

('172.16.0.3', '25/Sep/2002:14:04:19 +0200', 'GET / HTTP/1.1', '401', '', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827')

Upvotes: 1

Related Questions