Aaron
Aaron

Reputation: 165

Parse a custom log file into a dictionary using regex

Sample log file

Jun 15 02:04:59 combo sshd(pam_unix)[20897]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root\n'
Jun 15 02:04:59 combo sshd(pam_unix)[20898]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root\n'
Jun 15 04:06:18 combo su(pam_unix)[21416]: session opened for user cyrus by (uid=0)\n'
Jun 15 04:06:19 combo su(pam_unix)[21416]: session closed for user cyrus\n'
Jun 15 04:06:20 combo logrotate: ALERT exited abnormally with [1]\n'
Jun 15 04:12:42 combo su(pam_unix)[22644]: session opened for user news by (uid=0)\n'
Jun 15 04:12:43 combo su(pam_unix)[22644]: session closed for user news\n'

I want to split the data into 4 columns, Date, Time, PID and the Message.

Sample output would be

Dict = {"Date": "Jun 15", "Time": "02:04:59", "PID": "20897", "Message": "authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root\n'"}

After which I intend to save this info into a CSV file based on the columns

I've tried looking at other examples such as:

Parse a custom log file in python

How to parse this custom log file in Python

but I do not know how to create capture groups to help me achieve this.

The current Regex I have are

"(\w{3} \d{2})" for the Date

"(\d{2}:\d{2}:\d{2})" for time

"(?<=[).+?(?=]:)" for PID

"((?<=:).*)" for Message

but nothing happens when i combine them together

Upvotes: 2

Views: 436

Answers (3)

Elias Dabbas
Elias Dabbas

Reputation: 46

You might want to checkout the logs_to_df function from advertools. It parses any log format, and compresses the resulting file using the parquet format.

There are default formats supported, but if you have a custom format, you only need to provide a regex, and field names:


import advertools as adv
import pandas as pd

adv.logs_to_df(log_file='log_file.log',
               output_file='log_file.parquet',
               errors_file='log_file.txt',
               log_format=r'([A-Z][a-z]{2} \d\d \d\d:\d\d:\d\d) combo ([a-z]+\([a-z_]+?\))\[(\d+)\]: (.*)',
               fields=['datetime', 'program', 'pid', 'message'])

log_df = pd.read_parquet('log_file.parquet')
log_df
datetime program pid message
0 Jun 15 02:04:59 sshd(pam_unix) 20897 authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net user=root\n'
1 Jun 15 02:04:59 sshd(pam_unix) 20898 authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net user=root\n'
2 Jun 15 04:06:18 su(pam_unix) 21416 session opened for user cyrus by (uid=0)\n'
3 Jun 15 04:06:19 su(pam_unix) 21416 session closed for user cyrus\n'
4 Jun 15 04:12:42 su(pam_unix) 22644 session opened for user news by (uid=0)\n'
5 Jun 15 04:12:43 su(pam_unix) 22644 session closed for user news\n'

Note that the line containing "logrotate" wasn't included in the output file, but would be included in the errors_file, so you can further parse those, or check if there are actually issues with them.

Upvotes: 1

Alexandre B.
Alexandre B.

Reputation: 5502

A solution is to iterate over each row. For each row, select the Date, Time, PID and Message using a specific regex.

If they are found, return the value. Else, return None.

Here the code:

# Import module
import re

# Output list
out = []
# Read file
with open("data.txt", "r") as f:
    # Iterate over all lines
    for line in f.readlines():
        # Select the different fields
        date = re.search(r'^(\w{3}\s\d{2})', line)
        time = re.search(r'(\d{2}:\d{2}:\d{2})', line)
        PID = re.search(r'\[([0-9]+)\]:', line)
        message = re.search(r":\s(.*?)$", line)
        # Append them to the output using a dict
        # If field isn't found, None is return
        out.append({
            "Date": date.group(1) if date else None,
            "Time": time.group(1) if time else None,
            "PID": PID.group(1) if PID else None,
            "Message": message.group(1) if message else None
        })

output:

# [
#     {'Date': 'Jun 15', 'Time': '02:04:59', 'PID': '20897', 'Message': "authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root\\n'"},
#     {'Date': 'Jun 15', 'Time': '02:04:59', 'PID': '20898', 'Message': "authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root\\n'"},
#     {'Date': 'Jun 15', 'Time': '04:06:18', 'PID': '21416', 'Message': "session opened for user cyrus by (uid=0)\\n'"},
#     {'Date': 'Jun 15', 'Time': '04:06:19', 'PID': '21416', 'Message': "session closed for user cyrus\\n'"},
#     {'Date': 'Jun 15', 'Time': '04:06:20', 'PID': None, 'Message': "ALERT exited abnormally with [1]\\n'"},
#     {'Date': 'Jun 15', 'Time': '04:12:42', 'PID': '22644', 'Message': "session opened for user news by (uid=0)\\n'"},
#     {'Date': 'Jun 15', 'Time': '04:12:43', 'PID': '22644', 'Message': 'session closed for user news\\n'}
# ]

Hope that helps!

Upvotes: 1

justahuman
justahuman

Reputation: 637

What do you mean combine them together? Have you tried doing it in a for loop? That's probably that way I would go about doing it. It sounds like you are trying to capture all groups and passing them to the re.findall (I'm guessing). But findall is used to capture multiple instances of a single capture group. Hence, put your regex in a list, iterate and match each one using re.find or the captures method. The regex you have is correct (though for the date, I would capture the first two words of each line).

Upvotes: 0

Related Questions