parsing log files and finding appropriate delimiters

Question

Say I have a log file that looks like this:

'2021-05-18T14:01:13Z UTC [ db=dev user=rdsdb pid=11593 userid=1 xid=19771457 ]' LOG: BEGIN;
'2021-05-18T14:01:13Z UTC [ db=dev user=rdsdb pid=11593 userid=1 xid=19771457 ]' LOG: SET datestyle TO ISO;
'2021-05-18T14:01:13Z UTC [ db=dev user=rdsdb pid=11593 userid=1 xid=19771457 ]' LOG: SET TRANSACTION READ ONLY;
'2021-05-18T14:01:13Z UTC [ db=dev user=rdsdb pid=11593 userid=1 xid=19771457 ]' LOG: SET STATEMENT_TIMEOUT TO 300000;
'2021-05-18T14:01:13Z UTC [ db=dev user=rdsdb pid=11593 userid=1 xid=19771457 ]' LOG: /* hash: f71f47211eca32d63469fba576bbbb19 */

SELECT TRIM(application_name) AS application_name
     , MAX(recordtime)        AS last_used
FROM stl_connection_log
WHERE dbname <> 'dev'
  AND username <> 'rdsdb'
  AND (   application_name LIKE 'RedshiftUserLastLogin-v%'
       OR application_name LIKE 'RedshiftSystemTablePersistence-v%'
       OR application_name LIKE 'AnalyzeVacuumUtility-v%'
       OR application_name LIKE 'ColumnEncodingUtility-v%' )
GROUP BY application_name
LIMIT 50
;
'2021-05-18T14:01:13Z UTC [ db=dev user=rdsdb pid=11593 userid=1 xid=19771457 ]' LOG: SELECT btrim( pg_catalog.stll_connection_log.application_name ) AS application_name, MAX(pg_catalog.stll_connection_log.recordtime) AS last_used FROM pg_catalog.stll_connection_log WHERE pg_catalog.stll_connection_log.dbname <> 'dev'::Char(3) AND pg_catalog.stll_connection_log.username <> 'rdsdb'::Char(5) AND (pg_catalog.stll_connection_log.application_name LIKE 'AnalyzeVacuumUtility-v%' OR pg_catalog.stll_connection_log.application_name LIKE 'ColumnEncodingUtility-v%' OR pg_catalog.stll_connection_log.application_name LIKE 'RedshiftSystemTablePersistence-v%' OR pg_cata

I want to read in each "row" when it begins with the quote + timestamp. Each row begins with this: '2021-05-18T14:01:13Z UTC [ db=dev user=rdsdb pid=11593 userid=1 xid=19771457 ]' (let's call this a row delimiter) and then break apart each row into its appropriate column (query, pid, user, db, etc.). How do I do this in the easiest way possible?

The issue is that the row delimiter does not appear on each newline. As you can see, there is a "row" where the query is on multiple newlines and so when reading the text file in python, I'm worried that there will be several rows without a delimiter. So does this mean that as I read in lines from the file in python, I need to first check if it begins with a row delimiter and if not, continue appending the line to memory until a row delimiter is reached?

Ideally, the row:

'2021-05-18T14:01:13Z UTC [ db=dev user=rdsdb pid=11593 userid=1 xid=19771457 ]' LOG: /* hash: f71f47211eca32d63469fba576bbbb19 */

SELECT TRIM(application_name) AS application_name
     , MAX(recordtime)        AS last_used
FROM stl_connection_log
WHERE dbname <> 'dev'
  AND username <> 'rdsdb'
  AND (   application_name LIKE 'RedshiftUserLastLogin-v%'
       OR application_name LIKE 'RedshiftSystemTablePersistence-v%'
       OR application_name LIKE 'AnalyzeVacuumUtility-v%'
       OR application_name LIKE 'ColumnEncodingUtility-v%' )
GROUP BY application_name
LIMIT 50
;

is broken up into a csv row where:

timestamp = 2021-05-18T14:01:13Z UTC
db = dev
user = rdsdb
pid = 11593
userid = 1
xid = 19771457
query = `SELECT TRIM(application_name) AS application_name, MAX(recordtime AS last_used FROM stl_connection_log WHERE dbname <> 'dev' AND username <> 'rdsdb' AND (application_name LIKE 'RedshiftUserLastLogin-v%' OR application_name LIKE 'RedshiftSystemTablePersistence-v% OR application_name LIKE 'AnalyzeVacuumUtility-v%' OR application_name LIKE 'ColumnEncodingUtility-v%' ) GROUP BY application_name LIMIT 50';

and the row:

'2021-05-18T14:01:13Z UTC [ db=dev user=rdsdb pid=11593 userid=1 xid=19771457 ]' LOG: BEGIN;

is broken up into:

timestamp = 2021-05-18T14:01:13Z UTC
db = dev
user = rdsdb
pid = 11593
userid = 1
xid = 19771457
query = `LOG: BEGIN';

Martin Evans · Accepted Answer

You could use a regular expression to spot lines which are row delimiters, and then use groupby() to read groups of either delimiters or query lines:

from itertools import groupby
import re
import csv

row_delim = re.compile(r"'(.*? UTC) $$ db=(.*?) user=(.*?) pid=(.*?) userid=(.*?) xid=(.*?) $$'")

with open('logfile.txt') as f_input, open('output.csv', 'w', newline='') as f_output:
    csv_output = csv.writer(f_output)
    csv_output.writerow(["timestamp", "db", "user", "pid", "userid", "xid", "query"])

    for re_row_delim, lines in groupby(f_input, lambda x: row_delim.match(x)):
        if re_row_delim:
            last_delim = re_row_delim.groups()
        else:
            query = [line.strip() for line in lines if line.strip()]
            row = [*last_delim, ' '.join(query)]
            csv_output.writerow(row)

This assumes that the ordering of the parameters is constant.

For your given test file, groupby will first return a valid re_row_delim holding the result of the regular expression (in the lambda function), lines is a list of all the delimiter rows. As they are all the same, we just use the key value itself (i.e. re_row_delim) and ignore lines.

The next iteration it reads all the lines which do not match, i.e. the query string. In this case re_row_delim will be None as it failed to match, lines now holds all the query lines.

So for each iteration, it alternates between reading all the delimiter lines or all the query lines.

Giving you a CSV file as follows:

timestamp,db,user,pid,userid,xid,query
2021-05-18T14:01:13Z UTC,dev,rdsdb,11593,1,19771457,"SELECT TRIM(application_name) AS application_name , MAX(recordtime)        AS last_used FROM stl_connection_log WHERE dbname <> 'dev' AND username <> 'rdsdb' AND (   application_name LIKE 'RedshiftUserLastLogin-v%' OR application_name LIKE 'RedshiftSystemTablePersistence-v%' OR application_name LIKE 'AnalyzeVacuumUtility-v%' OR application_name LIKE 'ColumnEncodingUtility-v%' ) GROUP BY application_name LIMIT 50 ;"

If you need to also keep the text after LOG: on the timestamp lines, then a different approach is needed. The regular expression can capture that text and a query_start can hold all the text from each line:

from itertools import groupby
import re
import csv

row_delim = re.compile(r"'(.*? UTC) $$ db=(.*?) user=(.*?) pid=(.*?) userid=(.*?) xid=(.*?) $$' LOG: (.*)")
query_start = []    # holds all text after LOG: lines

with open('logfile.txt') as f_input, open('output.csv', 'w', newline='') as f_output:
    csv_output = csv.writer(f_output)
    csv_output.writerow(["timestamp", "db", "user", "pid", "userid", "xid", "query"])

    for re_row_delim, lines in groupby(f_input, lambda x: row_delim.match(x)):
        if re_row_delim:
            last_delim = re_row_delim.groups()[:6]
            query_start.extend([row_delim.match(line).group(7) for line in lines])
        else:
            query = [line.strip() for line in [*query_start, *lines] if line.strip()]
            row = [*last_delim, ' '.join(query)]
            csv_output.writerow(row)
            query_start = []

parsing log files and finding appropriate delimiters

Answers (1)

Related Questions