Kade Williams
Kade Williams

Reputation: 1181

Get fourth to last line where a string occurs in a file

I am currently searching through a log file that contains IP addresses.
Log example:

10.1.177.198 Tue Jun 19 09:25:16 CDT 2018
10.1.160.198 Tue Jun 19 09:25:38 CDT 2018
10.1.177.198 Tue Jun 19 09:25:36 CDT 2018
10.1.160.198 Tue Jun 19 09:25:40 CDT 2018
10.1.177.198 Tue Jun 19 09:26:38 CDT 2018
10.1.177.198 Tue Jun 19 09:27:16 CDT 2018
10.1.177.198 Tue Jun 19 09:28:38 CDT 2018

I can currently grab the IP address from the last line of the log. I can also search for all line numbers that have the same IP address.

If the last IP address in the log is listed 3 or more times in the log, how can I get the line number for the 3rd to last occurrence of that IP address?

For example, I want to get the line number for this line:

10.1.177.198 Tue Jun 19 09:26:38 CDT 2018

Or better yet, just print the entire line.

Here is an example of my code:

import re

def run():

    try:
        logfile = open('read.log', 'r')

        for line in logfile:  
            x1 = line.split()[0]
            for num, line in enumerate(logfile, 0):
                if x1 in line:
                    print("Found " + x1 + " at line:", num)

        print ('Last Line: ' + x1)

        logfile.close
    except OSError as e:
        print (e)

run()

I am listing all the line numbers where the specific IP address occurs.

print("Found " + x1 + " at line:", num)

I am wanting to print the line where "num" is the 3rd to last number in the list of line numbers.

My overall goal is to grab the IP address from the last line in the log file. Then check if it has previously been listed more than 3 times. If it has, I want to find the 3rd to last listing of the address and get the line number.(or just print the address and date listed on that line)

Upvotes: 0

Views: 90

Answers (3)

Venkata Gogu
Venkata Gogu

Reputation: 1051

Track all the occurences and print the 3rd one from the last. Can be optimized by using heapq.

def run():
    try:
        logfile = open('log.txt', 'r')

        ip_address_line_number = dict()
        for index,line in enumerate(logfile,1):  
            x1 = line.split()[0]
            log_time = line.split()[4]
            if x1 in ip_address_line_number : 
                ip_address_line_number[x1].append((index,log_time))
            else:
                ip_address_line_number[x1] = [(index,log_time)]

        if x1 in ip_address_line_number and len(ip_address_line_number.get(x1,None)) > 2:
            print('Last Line: '+ ip_address_line_number[x1][-3].__str__())
        else:
            print(x1 + ' has 0-2 occurences')
        logfile.close
    except OSError as e:
        print (e)

run()

Upvotes: 1

SpghttCd
SpghttCd

Reputation: 10860

Using pandas this would be quite short:

import pandas as pd
df = pd.read_fwf('read.log', colspecs=[(None, 12), (13, None)], header=None, names=['IP', 'time'])

lastIP = df.IP[df.index[-1]]
lastIP_idx = df.groupby('IP').groups[lastIP]

n = 3
if len(lastIP_idx) >= n:
    print('\t'.join(list( df.loc[lastIP_idx[-n]] )))
else:
    print('occurence number of ' + lastIP + ' < ' + str(n))

Upvotes: 0

pylang
pylang

Reputation: 44525

Another way to see this, if the file was read in reverse:

  • What is the line data for the third observation of the first ip?
  • In the file, there must be at least 3+1 observations of the first ip.

There are many tools that can offer even more simple code, but here is one flexible, general approach geared for memory efficiency. Roughly, let's:

  1. read the file backwards
  2. count up to 3+1 observations
  3. return the last observation

Given

A file test.log

# test.log 
10.1.177.198 Tue Jun 19 09:25:16 CDT 2018
10.1.160.198 Tue Jun 19 09:25:38 CDT 2018
10.1.177.198 Tue Jun 19 09:25:36 CDT 2018
10.1.160.198 Tue Jun 19 09:25:40 CDT 2018
10.1.177.198 Tue Jun 19 09:26:38 CDT 2018
10.1.177.198 Tue Jun 19 09:27:16 CDT 2018
10.1.177.198 Tue Jun 19 09:28:38 CDT 2018

and code for a reverse_readline() generator, we can write the following:

Code

def run(filename, target=3, min_=3):
    """Return the line number and data of the `target`-last observation.

    Parameters
    ----------
    filename : str or Path
        Filepath or name to file.
    target : int
        Number of final expected observations from the bottom, 
        e.g. "third to last observation." 
    min_ : int
        Total observations must exceed this number.

    """
    idx, prior, data = 0, "", []    
    for i, line  in enumerate(reverse_readline(filename)):
        ip, text = line.strip().split(maxsplit=1)
        if i == 0:
            target_ip = ip
        if target == 0:
            idx, *data = prior
        if ip == target_ip:
            target -= 1                                      
            prior = i, ip, text

    # Edge case
    total_obs = prior[0]
    if total_obs < min_:
        print(f"Minimum observations was not met.  Got {total_obs} observations.")
        return None

    # Compute line number
    line_num = (i - idx) + 1                               # add 1 (zero-indexed)
    return  [line_num] + data

Demo

run("test.log")
# [5, '10.1.177.198', 'Tue Jun 19 09:26:38 CDT 2018']

Second to last observation:

run("test.log", 2)
# [6, '10.1.177.198', 'Tue Jun 19 09:27:16 CDT 2018']

Minimum required observations:

run("test.log", 2, 7)
# Minimum observations was not met.  Got 6 observations.

Add error handling as needed.


Details

Note: an "observation" is a line containing the targeted ip.

  • We iterate the memory efficient reverse_readline() generator.
  • The target_ip is determined from the "first" line of the reversed file.
  • We are only interested in the third observation, so we need not save all information. Thus as we iterate, we only temporarily save one observation at a time to prior (reducing memory consumption).
  • target is a counter that is decremented after each observation. When the target counter reaches 0, the prior observation is saved until the generator is exhausted.
  • prior is a tuple containing line data for the last observation of the target ip address, i.e. index, address and text.
  • The generator is exhausted to determine the total_observations and length of the file, which is used to compute the line_number.
  • The computed line number and line data is returned.

Upvotes: 0

Related Questions