Reputation: 797

Extract everything between first and last occurence of the same pattern in single iteration

This question is very much the same as this except that I am looking to do this as fast as possible, doing only a single pass of the (unfortunately gzip compressed) file.

Given the pattern CAPTURE and input

1:.........
...........
100:CAPTURE
...........
150:CAPTURE
...........
200:CAPTURE
...........
1000:......

Print:

100:CAPTURE
...........
150:CAPTURE
...........
200:CAPTURE

Can this be accomplished with a regular expression?

I vaguely remember that this kind of grammar cannot be captured by a regular expression but not quite sure as regular expressions these days provide look aheads,etc.

Upvotes: 3

Answers (7)

Erric

Reputation: 797

While posting this question, the problem I had at hand was that I had several huge gzip compressed log files generated by a java application. The log lines were of the following format:

[Timestamp] (AppName) {EventId} [INFO]: Log text...
[Timestamp] (AppName) {EventId} [EXCEPTION]: Log text...
                    at com.application.class(Class.java:154)
                    caused by......
[Timestamp] (AppName) {EventId} [LogLevel]: Log text...

Given an EventId, I needed to extract all the lines corresponding to the event from these files. The problem became unsolvable with a trivial grep for EventId just due to the fact that the exception lines could be of arbitrary length and do not contain the EventId.

Unfortunately I forgot to consider the edge case where the last log line for an EventId could be the exception and the answers posted here would not print the stacktrace lines. However it wasn't hard to modify haukex's solution to cover these cases as well:

#!/usr/bin/env perl
use warnings;
use strict;

my $first=1;
my @buf;
while ( my $line = <> ) {
    push @buf, $line unless $first;
    if ( $line=~/EventId/ or ($first==0 and $line!~/\(AppName\)/)) {
        if ($first) {
            @buf = ($line);
            $first = 0;
        }
        print @buf;
        @buf = ();
    }
    else {
        $first = 1;
    }
}

I am still wondering if the faster solutions(mainly walter's sed solution or haukex's in-memory perl solution) could be modified to do the same.

Upvotes: 0

potong

Reputation: 58430

This might work for you (GNU sed):

sed '/CAPTURE/!d;:a;n;:b;//ba;$d;N;bb' file

Delete all lines until the first containing the required string. Print the line containing the required string. Replace the pattern space with the next line. If this line contains the required string, repeat the last two previous sentences. If it is the last line of the file, delete the pattern space. Otherwise, append the next line and repeat the last three previous sentences.

Having studied the test files used for haukex's benchmark, it would seem that sed is not the tool to extract this file. Using a mixture of csplit, grep and sed presents a reasonably fast solution as follows:

lines=$(grep -nTA1 --no-group-separator CAPTURE oldFile | 
        sed '1s/\t.*//;1h;$!d;s/\t.*//;H;x;s/\n/ /')
csplit -s oldFile $lines && rm xx0{0,2} && mv xx01 newFile

Split the original file into three files. A file preceding the first occurrence of CAPTURE, a file from the first CAPTURE to the last CAPTURE and a file containing of the remainder. The first and third files are discarded and the second file renamed.

csplit can use line numbers to split the original file. grep is extremely fast at filtering patterns and can return the line numbers of all patterns that match CAPTURE and the following context line. sed can manipulate the results of grep into two line numbers which are supplied to the csplit command.

When run against the test files (as above) I get timings around 10 seconds.

Upvotes: 0

Walter A

Reputation: 20002

Find the first CAPTURE and look back for the last one.

echo "/CAPTURE/,?CAPTURE? p" | ed -s <(gunzip -c inputfile.gz)

EDIT: Answer to comment and second (better?) solution.

When your input doesn't end with a newline, ed will complain, as shown by these tests.

# With newline
printf "1,$ p\n" | ed -s <(printf "%s\n" test)
# Without newline
printf "1,$ p\n" | ed -s <(printf "%s" test)
# message removed
printf "1,$ p\n" | ed -s <(printf "%s" test) 2> /dev/null

I do not know the memory complications this will give for a large file, but you would prefer a streaming solution.
You can use sed for the next approach. Keep reading lines until you find the first match. During this time only remember the last line read (by putting it in a Hold area).
Now change your tactics.
Append each line to the Hold area. You do not know when to flush until the next match.
When you have the next match, recall the Hold area and print this.
I needed some tweeking for preventing the second match to be printed twice. I solved this by reading the next line and replacing the HOLD area with that line.
The total solution is

gunzip -c inputfile.gz | sed -n '1,/CAPTURE/{h;n};H;/CAPTURE/{x;p;n;h};'

When you don't like the sed holding space, you can implemnt the same approach with awk:

gunzip -c inputfile.gz | 
   awk '/CAPTURE/{capt=1} capt==1{a[i++]=$0} /CAPTURE/{for(j=0;j<i;j++) print a[j]; i=0}'

Upvotes: 2

haukex

Reputation: 3013

You can buffer the lines until you see a line that contains CAPTURE, treating the first occurrence of the pattern specially.

#!/usr/bin/env perl
use warnings;
use strict;

my $first=1;
my @buf;
while ( my $line = <> ) {
    push @buf, $line unless $first;
    if ( $line=~/CAPTURE/ ) {
        if ($first) {
            @buf = ($line);
            $first = 0;
        }
        print @buf;
        @buf = ();
    }
}

Feed the input into this program via zcat file.gz | perl script.pl.

Which can of course be jammed into a one-liner, if need be...

zcat file.gz | perl -ne '$x&&push@b,$_;if(/CAPTURE/){$x||=@b=$_;print@b;@b=()}'

Can this be accomplished with a regular expression?

You mean in a single pass, in a single regex? If you don't mind reading the entire file into memory, sure... but this is obviously not a good idea for large files.

zcat file.gz | perl -0777ne '/((^.*CAPTURE.*$)(?s:.*)(?2)(?:\z|\n))/m and print $1'

Upvotes: 2

Tigran Khachikyan

Reputation: 302

Here is one more example with regex (the cons is that if files are large, it will consume a large memory)

#!/usr/bin/perl
{
  local $/ = undef;
  open FILE, $ARGV[0] or die "Couldn't open file: $!";
  binmode FILE;
  $string = <FILE>;
  close FILE;
}

print $1 if $string =~ /([^\n]+(CAPTURE).*\2.*?)\n/s;

Or with one liner:

cat file.tmp | perl -ne '$/=undef; print $1 if <STDIN> =~ /([^\n]+(CAPTURE).*\2.*?)\n/s'

result:

100:CAPTURE
...........
150:CAPTURE
...........
200:CAPTURE

Upvotes: 0

glenn jackman

Reputation: 246847

I would write

gunzip -c file.gz | sed -n '/CAPTURE/,$p' | tac | sed -n '/CAPTURE/,$p' | tac

Upvotes: 2

karakfa

Reputation: 67507

I don't think regex will be faster than double scan...

Here is an awk solution (double scan)

$ awk '/pattern/ && NR==FNR {a[++f]=NR; next} a[1]<=FNR && FNR<=a[f]' file{,}

Alternatively if you have any a priori information on where the patterns appear on the file you can have heuristic approaches which will be faster on those special cases.

Upvotes: 0

Extract everything between first and last occurence of the same pattern in single iteration

Answers (7)

Related Questions