StackGuru
StackGuru

Reputation: 503

How to parse log-file, containing uncertain pattern of data?

For a successful request, log file has been written as below

o 123456789.000 10.10.10.10 3 30 10 -
n A-123456 1452830400 1 1452
n C-73652 1452830400 1 23154
n B-967845 1452830400 1  37451
n G-809573 1452830400 1 92673   
o 123456789.000 10.10.10.10 3 30 10 some_random_text_alphanumeric_jdjfdjfdfhdkjfhdkhfdhfdfdfhkdhfkdjfdkjfkdfdkfdkjnc maxbgrsdfuyhlwkjdnkshbvhsgdvsdsjdbskdhskdjoihe73njndedejdoekekdednd

Line beginning with o and has some data (as shown above) and ends with another line beginning with o. This is the pattern.

If there are multiple such requests, log file keeps on appending as shown below

o 123456789.000 10.10.10.10 3 30 10 -
n A-123456 1452830400 1 1452
n C-73652 1452830400 1 23154
n B-967845 1452830400 1  37451
n G-809573 1452830400 1 92673   
o 123456789.000 10.10.10.10 3 30 10 some_random_text_alphanumeric_jdjfdjfdfhdkjfhdkhfdhfdfdfhkdhfkdjfdkjfkdfdkfdkjnc maxbgrsdfuyhlwkjdnkshbvhsgdvsdsjdbskdhskdjoihe73njndedejdoekekdednd
o 123456789.000 10.10.10.10 3 30 10 -
n A-123456 1452830400 1 1452
n C-73652 1452830400 1 23154
n B-967845 1452830400 1  37451
n G-809573 1452830400 1 92673   
o 123456789.000 10.10.10.10 3 30 10 some_random_text_alphanumeric_jdjfdjfdfhdkjfhdkhfdhfdfdfhkdhfkdjfdkjfkdfdkfdkjnc maxbgrsdfuyhlwkjdnkshbvhsgdvsdsjdbskdhskdjoihe73njndedejdoekekdednd

There might be some wrong generated data in the file, that can be for any earlier requests or latest requests.

If wrong-data generated is not for latest request, can be ignored.

Example :

o 123456789.000 10.10.10.10 3 30 10 -
n A-123456 1452830400 1 1452
n C-73652 1452830400 1 23154
n B-967845 1452830400 1  37451
n G-809573 1452830400 1 92673   
o 123456789.000 10.10.10.10 3 30 10 some_random_text_alphanumeric_jdjfdjfdfhdkjfhdkhfdhfdfdfhkdhfkdjfdkjfkdfdkfdkjnc maxbgrsdfuyhlwkjdnkshbvhsgdvsdsjdbskdhskdjoihe73njndedejdoekekdednd
o 123456789.000 10.10.10.10 3 30 10 -
n A-123456 1452830400 1 1452
n C-73652 1452830400 1 23154
n B-967845 1452830400 1  37451
n G-809573 1452830400 1 92673   
# Should be indication of request i.e., line beginning with o, followed some data
o 123456789.000 10.10.10.10 3 30 10 -
n A-123456 1452830400 1 1452
n C-73652 1452830400 1 23154
n B-967845 1452830400 1  37451
n G-809573 1452830400 1 92673   
o 123456789.000 10.10.10.10 3 30 10 some_random_text_alphanumeric_jdjfdjfdfhdkjfhdkhfdhfdfdfhkdhfkdjfdkjfkdfdkfdkjnc maxbgrsdfuyhlwkjdnkshbvhsgdvsdsjdbskdhskdjoihe73njndedejdoekekdednd

If wrong-data generated is for latest request, it should be caught and highlighted.

Example :

o 123456789.000 10.10.10.10 3 30 10 -
n A-123456 1452830400 1 1452
n C-73652 1452830400 1 23154
n B-967845 1452830400 1  37451
n G-809573 1452830400 1 92673   
o 123456789.000 10.10.10.10 3 30 10 some_random_text_alphanumeric_jdjfdjfdfhdkjfhdkhfdhfdfdfhkdhfkdjfdkjfkdfdkfdkjnc maxbgrsdfuyhlwkjdnkshbvhsgdvsdsjdbskdhskdjoihe73njndedejdoekekdednd
o 123456789.000 10.10.10.10 3 30 10 -
n A-123456 1452830400 1 1452
n C-73652 1452830400 1 23154
n B-967845 1452830400 1  37451
n G-809573 1452830400 1 92673   
o 123456789.000 10.10.10.10 3 30 10 some_random_text_alphanumeric_jdjfdjfdfhdkjfhdkhfdhfdfdfhkdhfkdjfdkjfkdfdkfdkjnc maxbgrsdfuyhlwkjdnkshbvhsgdvsdsjdbskdhskdjoihe73njndedejdoekekdednd
o 123456789.000 10.10.10.10 3 30 10 -
n A-123456 1452830400 1 1452
n C-73652 1452830400 1 23154
n B-967845 1452830400 1  37451
n G-809573 1452830400 1 92673   
# No line present i,e., (o 123456789.000 10.10.10.10 3 30 10 some_random_text_alphanumeric_jdjfdjfdfhdkjfhdkhfdhfdfdfhkdhfkdjfdkjfkdfdkfdkjnc maxbgrsdfuyhlwkjdnkshbvhsgdvsdsjdbskdhskdjoihe73njndedejdoekekdednd)

Missing line can be first line beginning with o or last line beginning with o

I need to check, for every request logs are written in this format and how many successful and un-successful requests are captured in a file ?

Approach 1: can read the file contents and then parse as if line beigns with o etc, which i don't feel feasible

Approach 2 : I feel, reg-ex is optimum and best solution.

Which would be best ? and could you please help me to achieve it ?

Tried so far:

reg_ex1 = "o\s+\d+(\.\d+)?\d+\s+\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\s+\d+\s+\d+\s+\d+\s+-"
reg_ex2 = "o\s+\d+(\.\d+)?\d+\s+\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\s+\d+\s+\d+\s+\d+\s+[a-zA-Z0-9_]+"
with open(""some_file.log, 'r') as content_file:
content = content_file.read()
pattern1 = re.compile(reg_ex1)
begin_lines = len(pattern1.findall(content))
pattern2 = re.compile(reg_ex2)
end_lines = len(pattern2.findall(content))
if begin_lines == end_lines:
   print "File has successful requests captured"
else:
   print "File has un-successful requests captured"
   # If wrong-data generated is not for latest request, can be ignored.
   # If wrong-data generated is for latest request, it should be caught and highlighted.

May be not a good idea though, please let me know.

UPD:

o 123456789.000 10.10.10.10 3 30 10 001-
n A-123456===123 1452830400 1 14521
n C-73652 1452830400 1 231541
n B-967845 1452830400 1  37451
n G-809573==123 1452830400 1 926731
o 123456789.000 10.10.10.10 3 30 10 some_random_text_alphanumeric_jdjfdjfdfhdkjfhdkhfdhfdfdfhkdhfkdjfdkjfkdfdkfdkjnc maxbgrsdfuyhlwkjdnkshbvhsgdvsdsjdbskdhskdjoihe73njndedejdoekekdednd
o 123456789.000 10.10.10.10 3 30 10 002-
n A-123456===456 1452830400 1 14522
n C-73652 1452830400 1 231542
n B-967845 1452830400 1  37452
n G-809573===456 1452830400 1 926732
o 123456789.000 10.10.10.10 3 30 10 003-
n A-123456===789 1452830400 1 14523
n C-73652 1452830400 1 231543
n B-967845 1452830400 1  374513
n G-809573===789 1452830400 1 926733
o 123456789.000 10.10.10.10 3 30 10 some_random_text_alphanumeric_jdjfdjfdfhdkjfhdkhfdhfdfdfhkdhfkdjfdkjfkdfdkfdkjnc maxbgrsdfuyhlwkjdnkshbvhsgdvsdsjdbskdhskdjoihe73njndedejdoekekdednd

For the above text, I would want to extract Packet 1 and 3.

Upvotes: 1

Views: 179

Answers (3)

DirtyBit
DirtyBit

Reputation: 16772

To check if the file is good or bad, We'd play with the first and last line of the file, Considering;

  1. The file is bad if the first line does not start with o
  2. The file is bad if the last line does not end with o
  3. The file is good if the first and the last line starts with o

list.txt:

o 123456789.000 10.10.10.10 3 30 10 -
n A-123456 1452830400 1 1452
n C-73652 1452830400 1 23154
n B-967845 1452830400 1  37451
n G-809573 1452830400 1 92673
o 123456789.000 10.10.10.10 3 30 10 some_random_text_alphanumeric_jdjfdjfdfhdkjfhdkhfdhfdfdfhkdhfkdjfdkjfkdfdkfdkjnc maxbgrsdfuyhlwkjdnkshbvhsgdvsdsjdbskdhskdjoihe73njndedejdoekekdednd
o 123456789.000 10.10.10.10 3 30 10 -
n A-123456 1452830400 1 1452
n C-73652 1452830400 1 23154
n B-967845 1452830400 1  37451
n G-809573 1452830400 1 92673
# Should be indication of request i.e., line beginning with o, followed some data
o 123456789.000 10.10.10.10 3 30 10 -
n A-123456 1452830400 1 1452
n C-73652 1452830400 1 23154
n B-967845 1452830400 1  37451
n G-809573 1452830400 1 92673
o 123456789.000 10.10.10.10 3 30 10 some_random_text_alphanumeric_jdjfdjfdfhdkjfhdkhfd

Hence:

logFile = "list.txt"    
with open(logFile) as f:
    content = f.readlines()

# you may also want to remove empty lines
content = [l.strip() for l in content if l.strip()]

for line in content:
    if line.startswith("o"):  # check if the first line starts with o
        if str(content[-1]).strip("[']").split()[0] == 'o': # check if last line starts with o
            print("File is good.")
        else:
            print("File is bad.")
        break
    else:                    # end if the first line does not start with o
        print("File is bad.")
        break

EDIT:

To get all the responses between valid pair of o's:

list.txt:

o 123456789.000 10.10.10.10 3 30 10 001-
n A-123456 1452830400 1 14521
n C-73652 1452830400 1 231541
n B-967845 1452830400 1  37451
n G-809573 1452830400 1 926731
o 123456789.000 10.10.10.10 3 30 10 some_random_text_alphanumeric_jdjfdjfdfhdkjfhdkhfdhfdfdfhkdhfkdjfdkjfkdfdkfdkjnc maxbgrsdfuyhlwkjdnkshbvhsgdvsdsjdbskdhskdjoihe73njndedejdoekekdednd
o 123456789.000 10.10.10.10 3 30 10 002-
n A-123456 1452830400 1 14522
n C-73652 1452830400 1 231542
n B-967845 1452830400 1  37452
n G-809573 1452830400 1 926732
o 123456789.000 10.10.10.10 3 30 10 some_random_text_alphanumeric_jdjfdjfdfhdkjfhdkhfdhfdfdfhkdhfkdjfdkjfkdfdkfdkjnc maxbgrsdfuyhlwkjdnkshbvhsgdvsdsjdbskdhskdjoihe73njndedejdoekekdednd
o 123456789.000 10.10.10.10 3 30 10 003-
n A-123456 1452830400 1 14523
n C-73652 1452830400 1 231543
n B-967845 1452830400 1  374513
n G-809573 1452830400 1 926733

Hence:

import re
def GetTheResponses(infile):
     with open(infile) as fp:
         red = fp.read()
         for result in re.findall('o (.*?)o ', red, re.S):
             print(result)

GetTheResponses('list.txt')

OUTPUT:

123456789.000 10.10.10.10 3 30 10 001-
n A-123456 1452830400 1 14521
n C-73652 1452830400 1 231541
n B-967845 1452830400 1  37451
n G-809573 1452830400 1 926731

123456789.000 10.10.10.10 3 30 10 002-
n A-123456 1452830400 1 14522
n C-73652 1452830400 1 231542
n B-967845 1452830400 1  37452
n G-809573 1452830400 1 926732

EDIT 2: (for better readability):

count = 1
for result in re.findall('o (.*?)o ', red, re.S):
    print("Response Packet: {}".format(count))
    print("\n".join(result.split("\n")[1:]))
    count +=1

OUTPUT:

Response Packet: 1
n A-123456 1452830400 1 14521
n C-73652 1452830400 1 231541
n B-967845 1452830400 1  37451
n G-809573 1452830400 1 926731

Response Packet: 2
n A-123456 1452830400 1 14522
n C-73652 1452830400 1 231542
n B-967845 1452830400 1  37452
n G-809573 1452830400 1 926732

Upvotes: 2

Andrew Kang
Andrew Kang

Reputation: 328

^n\s.+[\n\r]+o\s.+[\n\r]+n\s.+|^n\s.+[\n\r]+n\s.+[\n\r]+n\s.+[\n\r]+n\s.+[\n\r]+(?!o)|^o\s.+[\n\r]+o\s.+[\n\r]+o\s.+

enter image description here

Upvotes: 0

Deepak Sharma
Deepak Sharma

Reputation: 1739

I would definetely recomment Approach1, Why..? in this way, we have flexibilty to read/iterate each line specifically..

with open('file.txt', r) as fp:
  line = fp.readline()
  print(type(line))  #string

  #do anything with line(string)
  #1. split_list= fp.split() -- list of values separated by space
  #2. Check type of each element: 
   # split_list[0].isalpha(), 
   # split_list[0].isalpha(),
   # split_list[0].isdigit(),
   # split_list[0].isspace() like so, and then do required adding to final dict/list..

Make sure to try: catch, every step.. as log files are unpredictable.

Upvotes: 0

Related Questions