Reputation: 165
Sample log file
Jun 15 02:04:59 combo sshd(pam_unix)[20897]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net user=root\n'
Jun 15 02:04:59 combo sshd(pam_unix)[20898]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net user=root\n'
Jun 15 04:06:18 combo su(pam_unix)[21416]: session opened for user cyrus by (uid=0)\n'
Jun 15 04:06:19 combo su(pam_unix)[21416]: session closed for user cyrus\n'
Jun 15 04:06:20 combo logrotate: ALERT exited abnormally with [1]\n'
Jun 15 04:12:42 combo su(pam_unix)[22644]: session opened for user news by (uid=0)\n'
Jun 15 04:12:43 combo su(pam_unix)[22644]: session closed for user news\n'
I want to split the data into 4 columns, Date, Time, PID and the Message.
Sample output would be
Dict = {"Date": "Jun 15", "Time": "02:04:59", "PID": "20897", "Message": "authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net user=root\n'"}
After which I intend to save this info into a CSV file based on the columns
I've tried looking at other examples such as:
Parse a custom log file in python
How to parse this custom log file in Python
but I do not know how to create capture groups to help me achieve this.
The current Regex I have are
"(\w{3} \d{2})" for the Date
"(\d{2}:\d{2}:\d{2})" for time
"(?<=[).+?(?=]:)" for PID
"((?<=:).*)" for Message
but nothing happens when i combine them together
Upvotes: 2
Views: 436
Reputation: 46
You might want to checkout the logs_to_df function from advertools
. It parses any log format, and compresses the resulting file using the parquet format.
There are default formats supported, but if you have a custom format, you only need to provide a regex, and field names:
import advertools as adv
import pandas as pd
adv.logs_to_df(log_file='log_file.log',
output_file='log_file.parquet',
errors_file='log_file.txt',
log_format=r'([A-Z][a-z]{2} \d\d \d\d:\d\d:\d\d) combo ([a-z]+\([a-z_]+?\))\[(\d+)\]: (.*)',
fields=['datetime', 'program', 'pid', 'message'])
log_df = pd.read_parquet('log_file.parquet')
log_df
datetime | program | pid | message | |
---|---|---|---|---|
0 | Jun 15 02:04:59 | sshd(pam_unix) | 20897 | authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net user=root\n' |
1 | Jun 15 02:04:59 | sshd(pam_unix) | 20898 | authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net user=root\n' |
2 | Jun 15 04:06:18 | su(pam_unix) | 21416 | session opened for user cyrus by (uid=0)\n' |
3 | Jun 15 04:06:19 | su(pam_unix) | 21416 | session closed for user cyrus\n' |
4 | Jun 15 04:12:42 | su(pam_unix) | 22644 | session opened for user news by (uid=0)\n' |
5 | Jun 15 04:12:43 | su(pam_unix) | 22644 | session closed for user news\n' |
Note that the line containing "logrotate" wasn't included in the output file, but would be included in the errors_file
, so you can further parse those, or check if there are actually issues with them.
Upvotes: 1
Reputation: 5502
A solution is to iterate over each row. For each row, select the Date
, Time
, PID
and Message
using a specific regex.
If they are found, return the value. Else, return None
.
Here the code:
# Import module
import re
# Output list
out = []
# Read file
with open("data.txt", "r") as f:
# Iterate over all lines
for line in f.readlines():
# Select the different fields
date = re.search(r'^(\w{3}\s\d{2})', line)
time = re.search(r'(\d{2}:\d{2}:\d{2})', line)
PID = re.search(r'\[([0-9]+)\]:', line)
message = re.search(r":\s(.*?)$", line)
# Append them to the output using a dict
# If field isn't found, None is return
out.append({
"Date": date.group(1) if date else None,
"Time": time.group(1) if time else None,
"PID": PID.group(1) if PID else None,
"Message": message.group(1) if message else None
})
output:
# [
# {'Date': 'Jun 15', 'Time': '02:04:59', 'PID': '20897', 'Message': "authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net user=root\\n'"},
# {'Date': 'Jun 15', 'Time': '02:04:59', 'PID': '20898', 'Message': "authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net user=root\\n'"},
# {'Date': 'Jun 15', 'Time': '04:06:18', 'PID': '21416', 'Message': "session opened for user cyrus by (uid=0)\\n'"},
# {'Date': 'Jun 15', 'Time': '04:06:19', 'PID': '21416', 'Message': "session closed for user cyrus\\n'"},
# {'Date': 'Jun 15', 'Time': '04:06:20', 'PID': None, 'Message': "ALERT exited abnormally with [1]\\n'"},
# {'Date': 'Jun 15', 'Time': '04:12:42', 'PID': '22644', 'Message': "session opened for user news by (uid=0)\\n'"},
# {'Date': 'Jun 15', 'Time': '04:12:43', 'PID': '22644', 'Message': 'session closed for user news\\n'}
# ]
Hope that helps!
Upvotes: 1
Reputation: 637
What do you mean combine them together? Have you tried doing it in a for loop? That's probably that way I would go about doing it. It sounds like you are trying to capture all groups and passing them to the re.findall
(I'm guessing). But findall is used to capture multiple instances of a single capture group. Hence, put your regex in a list, iterate and match each one using re.find
or the captures
method. The regex you have is correct (though for the date, I would capture the first two words of each line).
Upvotes: 0