How to search a file for the last block of consecutive lines that contain a keyword in Perl

Imagine a text file like below where <some random text> could be anything or nothing, implying the KEYWORD can appear anywhere in the line, alone or along with other text:

 1 <some random text>
 2 <some random text>KEYWORD<some random text>
 3 <some random text>KEYWORD<some random text>
 4 <some random text>
 5 <some random text>
 6 <some random text>KEYWORD<some random text>
 7 <some random text>
 8 <some random text>KEYWORD<some random text>
 9 <some random text>KEYWORD<some random text>
10 <some random text>KEYWORD<some random text>
11 <some random text>
12 <some random text>KEYWORD<some random text>
13 <some random text>KEYWORD<some random text>
14 <some random text>
15 <some random text>KEYWORD<some random text>
16 <some random text>

How can I get the last occurrence of 2 or more consecutive lines that contain the keyword (lines 12 and 13 in the example)? To be clear, I am not interested in lines (8, 9, 10) because although they contain the keyword and are consecutive, they are not the last, nor in line 15 because although it contains the keyword and is the last line with keyword, it is not part of 2 or more consecutive lines.

Upvotes: 3

Views: 161

Answers (3)

zdim
zdim

Reputation: 66899

Record such sequences of lines with the pattern as they come, always keeping the last set, and once the file is out you will have had the very last set. (Or read backwards if the file is large, per info added in a comment; see the second section below.)

A straightforward way

use warnings;
use strict;
use feature 'say';

die "Usage: $0 file(s)\n"  if not @ARGV;

my $threshold = 2;

my (@buf, $cnt, @res);

while (<>) {
    if (not /KEYWORD/) {
        $cnt = 0  if $cnt;
        @buf = () if @buf;
        next 
    }   

    ++$cnt;
    push @buf, $_; 

    if ($cnt >= $threshold) {
        @res = @buf;  # excessive copying; refine if a problem
    }
}
print for @res;

Remove the @ARGV check to allow STDIN input, which <> reads with no files given.

Notes

  • Lines go into a buffer until the threshold condition (number of repeated ones) is met, and counter is increased. On a line without a pattern these are reset

  • It's just once here (only two repeated lines are needed) so it'd be easier for later handling to copy the line into a scalar to save it, but using an array works for any threshold

  • Once the condition is met the buffer is copied. While that need be done for the first line that matches the threshold, to overwrite the @res from before, copying the whole array isn't needed for the following repeated lines -- can instead add the line once the threshold's passed.

    This requires an additional little tap dance; here's one way (minimally tested)

    while (<>) {
        if (not /KEYWORD/) {
            $cnt = 0  if $cnt;
            @buf = () if @buf;
            next
        }
        ++$cnt;
    
        if ($cnt < $threshold) {
            push @buf, $_;
        }
        elsif ($cnt == $threshold) {
            @res = (@buf, $_);
        }
        else {
            push @res, $_
        }
    }
    

    Now the buffer is copied the first time a line with the pattern adds to the count greater than the threshold, but the following lines are added without an extra buffer copy. (If such sequences of lines are very infrequent, or the file is rather small, this will not have a noticable effect.)

If you need to know where in the file these are save the line number $., along with lines.

If a file can be large -- and this is the only thing to be done with it -- we can use the same code but going backwards, from the end of the file. A module for that is File::ReadBackwards.


To illustrate the performance gain, here is a program to do the same by reading the file backwards

use warnings;
use strict;
use feature 'say';

use File::ReadBackwards;

my (@buf, $cnt, @res);
my $threshold = 2;

my $bw = File::ReadBackwards->new(shift) or die $!;     
#print $bw->readline until $bw->eof; exit;  # test

while ( my $line = $bw->readline ) {     
    if (not $line =~ /KEYWORD/) {    
        last if @res >= $threshold;
        $cnt = 0  if $cnt;
        @buf = () if @buf;
        next 
    }   
    ++$cnt;

    if ($cnt  < $threshold) { 
        push @buf, $line;
    }   
    elsif ($cnt == $threshold) { 
        @res = (@buf, $line);
    }   
    else { 
        push @res, $line;
    }
}    
print for reverse @res;

This produces the same output as the program that reads from the beginning.

I append the test file 200k times, for a file of 111 Mb in size. The first program (adjusted for performance as in notes) takes ~1.85 sec on it (average over a few runs) while the one above goes in 0.02 sec.

So, the saving is sweet for large enough files; the small overhead in reading from the back is entirely unseen. However, no other processing can be done along the way as the front of the file is never seen at all. Also, the target must be seekable (a file), and very few operations are supported; for one, we don't get line numbers.


This is for the whole program, startup and all, measured by time on the command line as the program is invoked, and averaged over a few runs.

When I time just the code itself, using Time::HiRes, the runtimes to process the file are

  • in the second program at the fourth (4th) decimal place, for example 0.0003 sec

  • in the first program it's of course still 1.8881 sec or some such

Upvotes: 5

Polar Bear
Polar Bear

Reputation: 6808

If you do not mind to read the file line by line from top to bottom then following approach should work.

use strict;
use warnings;
use feature 'say';

use Data::Dumper;

my @found_array;
my @lookup_array;

while(<DATA>) {
    chomp;
    if ( not /KEYWORD/ ) {
        @found_array = @lookup_array if scalar @lookup_array == 2;
        @lookup_array = ();
        next;
    }
    push @lookup_array,$_;
}

if( scalar @found_array == 2 ) {
    say Dumper(\@found_array);
} else {
    say "No 2 sequential matching lines with KEYWORD was found" 
}

__DATA__
 1 <some random text>
 2 <some random text>KEYWORD<some random text>
 3 <some random text>KEYWORD<some random text>
 4 <some random text>
 5 <some random text>
 6 <some random text>KEYWORD<some random text>
 7 <some random text>
 8 <some random text>KEYWORD<some random text>
 9 <some random text>KEYWORD<some random text>
10 <some random text>KEYWORD<some random text>
11 <some random text>
12 <some random text>KEYWORD<some random text>
13 <some random text>KEYWORD<some random text>
14 <some random text>
15 <some random text>KEYWORD<some random text>
16 <some random text>

Output

$VAR1 = [
          '12 <some random text>KEYWORD<some random text>',
          '13 <some random text>KEYWORD<some random text>'
        ];

Upvotes: 2

ceving
ceving

Reputation: 23876

Not the most efficient solution but easy to read:

#! /usr/bin/perl
use strict;
use warnings;

# Read input
$_ = do { local $/; <DATA> };

# Split string by lines not containing the keyword.
my @blocks = split /^((?!KEYWORD).)*$/m;

# Remove leading newlines
@blocks = map { s/^\n//s; $_ } @blocks;

# Keep only those blocks, which contain the KEYWORD twice.
@blocks = grep (/KEYWORD.*KEYWORD/s, @blocks);

# Take the last.
print pop (@blocks);

__DATA__
 1 <some random text>
 2 <some random text>KEYWORD<some random text>
 3 <some random text>KEYWORD<some random text>
 4 <some random text>
 5 <some random text>
 6 <some random text>KEYWORD<some random text>
 7 <some random text>
 8 <some random text>KEYWORD<some random text>
 9 <some random text>KEYWORD<some random text>
10 <some random text>KEYWORD<some random text>
11 <some random text>
12 <some random text>KEYWORD<some random text>
13 <some random text>KEYWORD<some random text>
14 <some random text>
15 <some random text>KEYWORD<some random text>
16 <some random text>

Upvotes: 1

Related Questions