aerion
aerion

Reputation: 722

Lines from first match (pattern 1) to last match (pattern 2)

I'd like to grep/sed file to get all the lines from the first match (pattern 1) to the last match (pattern 2). Example:

[aaa] text1
[bbb] text1.5 <- first bbb
[aaa] text2
[bbb] text3
[bbb] text4
[bbb] text5
[zzz] text5.5
[ccc] text6
[ddd] text6.5
[ccc] text7 <- last ccc
[ddd] text8
[ddd] text9

Pattern 1: bbb Pattern 2: ccc Output:

[bbb] text1.5 <- first bbb
[aaa] text2
[bbb] text3
[bbb] text4
[bbb] text5
[zzz] text5.5
[ccc] text6
[ddd] text6.5
[ccc] text7 <- last ccc

I was able to retrieve output from first match (pattern 1) to first match (pattern 2) using sed -n -e '/bbb/,/ccc/{ p; }' (without "text 7" though).

Edit: I need this solution as fast as possible, cause it is supposed to work with huge (many GB) files.

Upvotes: 1

Views: 112

Answers (6)

ctac_
ctac_

Reputation: 2471

You can use this sed too with the same memory issues problem than the Sundeep answer.

sed -n '/bbb/,/ccc/p;/ccc/!b;:A;N;/\n.*ccc/!bA;s/[^\n]*\n//;p;s/.*//;bA' infile

Upvotes: 0

Ed Morton
Ed Morton

Reputation: 203712

You said you want the fastest solution because your file is huge but you probably also need the most memory efficient solution because, as you said, your file is huge and in a tradeoff between a script running slowly vs a script running out of memory, the speed of execution takes 2nd place. You also might find a script that seems to be fast initially starts slowing down as it eats up memory.

So, IMHO the simplest and most robust (since it holds nothing but 2 numbers in memory) approach is 2 passes, one to identify the beginning and ending line numbers and then the second to print all lines between those points:

$ awk -v beg='[bbb]' -v end='[ccc]' '
    NR==FNR { if (($1 == beg) && !begFnr) begFnr=FNR; if ($1 == end) endFnr=FNR; next }
    FNR>=begFnr && FNR<=endFnr
' file file
[bbb] text1.5 <- first bbb
[aaa] text2
[bbb] text3
[bbb] text4
[bbb] text5
[zzz] text5.5
[ccc] text6
[ddd] text6.5
[ccc] text7 <- last ccc

Upvotes: 1

Nic3500
Nic3500

Reputation: 8611

The OP asked me to post my Perl solution, in case it might help someone else.

It scans the input file only once. It does requires - at max - double the disk space than the input file already takes (the input file + the result if the entire input file is between the start and end tags). I decided to buffer using the disk, since memory might not be large enough if the file is super big.

Here is the code:

#!/usr/bin/perl -w
#
################################################################################

use strict;

my($inputfile);
my($outputfile);
my($bufferfile) = "/tmp/bufferfile.tmp";
my($startpattern);
my($endpattern);

#################################################
# Subroutines
#################################################
sub show_usage
{
    print("Takes 4 arguments:\n");
    print("   1) the name of the file to process.\n");
    print("   2) the name of the output file.\n");
    print("   3) the start pattern.\n");
    print("   4) the end pattern.\n");
    exit;
}

sub close_outfiles
{
    close(OUTPUTFILE);
    close(BUFFERFILE);
}

sub cat_buffer_to_output
{
    # Open outputfile in append mode
    open(OUTPUTFILE,">>","$outputfile") or die "ERROR: could not open outputfile $outputfile (append mode)!";
    # Open bufferfile in read mode
    open(BUFFERFILE,"$bufferfile") or die "ERROR: could not open bufferfile $bufferfile (read mode)!";
    # Dump the content of the buffer to the output
    print OUTPUTFILE while <BUFFERFILE>;
    close_outfiles();
    # Reopen the bufferfile, with > to truncate it
    open(BUFFERFILE,">","$bufferfile") or die "ERROR: could not open bufferfile $bufferfile (write mode)!";
}

#################################################
# Main
#################################################

# Manage arguments
if (@ARGV != 4)
{
    show_usage();
}
else
{
    $inputfile = $ARGV[0];
    $outputfile = $ARGV[1];
    $startpattern = $ARGV[2];
    $endpattern = $ARGV[3];
}

# Open the files, the first time
open(INPUTFILE,"$inputfile") or die "ERROR: could not open inputfile $inputfile (read mode)!";
open(OUTPUTFILE,">","$outputfile") or die "ERROR: could not open outputfile $outputfile (write mode)!";
open(BUFFERFILE,">","$bufferfile") or die "ERROR: could not open bufferfile $bufferfile (write mode)!";

my($sendtobuffer) = 0;

while (<INPUTFILE>)
{
    # If I see the endpattern, empty the buffer file into the output file
    if ($_ =~ /$endpattern/)
    {
        print BUFFERFILE;
        cat_buffer_to_output();
    }
    else
    {
        # if sendtobuffer, the start pattern was seen at least once, print to BUFFERFILE
        if ($sendtobuffer)
        {
            print BUFFERFILE;
        }
        else
        {
            # if I see the start pattern, print to buffer and print future lines to buffer as well
            if ($_ =~ /$startpattern/)
            {
                print BUFFERFILE;
                $sendtobuffer = 1;
            }
        }
    }
}

# Close files
close(INPUTFILE);
close_outfiles();

# cleanup
unlink($bufferfile);

Basically it reads through the input file. When it sees the start pattern for the first time, it starts writing lines to a buffer file. When the end pattern is seen, it dumps the contents of the buffer file into the output file and truncates the buffer file. Since it does that until the end of file, every time the end pattern is seen, it dumps the buffer file into the output file.

Upvotes: 0

Sundeep
Sundeep

Reputation: 23677

Using awk and buffers to save lines between ccc, might run into memory issues if there is huge gap between two occurrences of ccc

$ awk 's{buf=buf?buf RS $0:$0; if(/ccc/){print buf; buf=""} next}
       /bbb/{f=1} f; /ccc/{s=1}' ip.txt
[bbb] text1.5 <- first bbb
[aaa] text2
[bbb] text3
[bbb] text4
[bbb] text5
[zzz] text5.5
[ccc] text6
[ddd] text6.5
[ccc] text7 <- last ccc
  • /bbb/{f=1} f; /ccc/{s=1} to print lines between first occurrences of bbb and ccc. It also sets the s flag for lines after first occurrence of ccc
  • after s is set
    • buf=buf?buf RS $0:$0; accumulate lines in a buffer
    • if(/ccc/){print buf; buf=""} if line contains ccc, print the buffer contents and then clear it
    • next because we don't need rest of the code


Can also use

awk 'f || /bbb/{buf=buf?buf RS $0:$0; if(/ccc/){print buf; buf=""} f=1}' ip.txt

Upvotes: 1

ghoti
ghoti

Reputation: 46856

You've already got a sed solution that works. A more "efficient" sed solution would require an unknown amount of memory to be used as a buffer, which might be problematic depending on your data and your system.

Another possibility might be to use awk. The following should work with most versions of awk...

awk 'NR==FNR && $1~/bbb/ && !a { a=NR } NR==FNR && $1~/ccc/ { b=NR } NR==FNR {next} FNR >= a && FNR <= b' file.txt file.txt

Broken out for easier reading and commenting

# If we're reading first file, and we see our start pattern,
# and we haven't seen it before, set "a" as our start record.
NR==FNR && $1~/bbb/ && !a { a=NR }

# If we're reading the first file, and we see our end pattern,
# set "b" as our end record.
NR==FNR && $1~/ccc/ { b=NR }

# If we're in the first file, move on to the next line.
NR==FNR {next}

# Now that we're in the second file...  If the current line is
# between (or inclusive of) our start/end records, print the line.
FNR >= a && FNR <= b

While this does read the file twice, it doesn't store any large quantities of data in memory.

Upvotes: 2

Nic3500
Nic3500

Reputation: 8611

Someone will probably come up with a one liner, but I got this:

#!/bin/bash
#
start=$(grep -n bbb data | head -1 | cut -d':' -f1)
end=$(grep -n ccc data | tail -1 | cut -d':' -f1)

sed -n "${start},${end}p" data

Get the start line, get the end line, print between these numbers.

Upvotes: 2

Related Questions