Reputation: 797
This question is very much the same as this except that I am looking to do this as fast as possible, doing only a single pass of the (unfortunately gzip compressed) file.
Given the pattern CAPTURE
and input
1:.........
...........
100:CAPTURE
...........
150:CAPTURE
...........
200:CAPTURE
...........
1000:......
Print:
100:CAPTURE
...........
150:CAPTURE
...........
200:CAPTURE
Can this be accomplished with a regular expression?
I vaguely remember that this kind of grammar cannot be captured by a regular expression but not quite sure as regular expressions these days provide look aheads,etc.
Upvotes: 3
Views: 546
Reputation: 797
While posting this question, the problem I had at hand was that I had several huge gzip compressed log files generated by a java application. The log lines were of the following format:
[Timestamp] (AppName) {EventId} [INFO]: Log text...
[Timestamp] (AppName) {EventId} [EXCEPTION]: Log text...
at com.application.class(Class.java:154)
caused by......
[Timestamp] (AppName) {EventId} [LogLevel]: Log text...
Given an EventId
, I needed to extract all the lines corresponding to the event from these files. The problem became unsolvable with a trivial grep for EventId
just due to the fact that the exception lines could be of arbitrary length and do not contain the EventId
.
Unfortunately I forgot to consider the edge case where the last log line for an EventId
could be the exception and the answers posted here would not print the stacktrace lines. However it wasn't hard to modify haukex's solution to cover these cases as well:
#!/usr/bin/env perl
use warnings;
use strict;
my $first=1;
my @buf;
while ( my $line = <> ) {
push @buf, $line unless $first;
if ( $line=~/EventId/ or ($first==0 and $line!~/\(AppName\)/)) {
if ($first) {
@buf = ($line);
$first = 0;
}
print @buf;
@buf = ();
}
else {
$first = 1;
}
}
I am still wondering if the faster solutions(mainly walter's sed solution or haukex's in-memory perl solution) could be modified to do the same.
Upvotes: 0
Reputation: 58430
This might work for you (GNU sed):
sed '/CAPTURE/!d;:a;n;:b;//ba;$d;N;bb' file
Delete all lines until the first containing the required string. Print the line containing the required string. Replace the pattern space with the next line. If this line contains the required string, repeat the last two previous sentences. If it is the last line of the file, delete the pattern space. Otherwise, append the next line and repeat the last three previous sentences.
Having studied the test files used for haukex's benchmark, it would seem that sed is not the tool to extract this file. Using a mixture of csplit, grep and sed presents a reasonably fast solution as follows:
lines=$(grep -nTA1 --no-group-separator CAPTURE oldFile |
sed '1s/\t.*//;1h;$!d;s/\t.*//;H;x;s/\n/ /')
csplit -s oldFile $lines && rm xx0{0,2} && mv xx01 newFile
Split the original file into three files. A file preceding the first occurrence of CAPTURE
, a file from the first CAPTURE
to the last CAPTURE
and a file containing of the remainder. The first and third files are discarded and the second file renamed.
csplit
can use line numbers to split the original file. grep
is extremely fast at filtering patterns and can return the line numbers of all patterns that match CAPTURE
and the following context line. sed
can manipulate the results of grep
into two line numbers which are supplied to the csplit
command.
When run against the test files (as above) I get timings around 10 seconds.
Upvotes: 0
Reputation: 20002
Find the first CAPTURE and look back for the last one.
echo "/CAPTURE/,?CAPTURE? p" | ed -s <(gunzip -c inputfile.gz)
EDIT: Answer to comment and second (better?) solution.
When your input doesn't end with a newline, ed
will complain, as shown by these tests.
# With newline
printf "1,$ p\n" | ed -s <(printf "%s\n" test)
# Without newline
printf "1,$ p\n" | ed -s <(printf "%s" test)
# message removed
printf "1,$ p\n" | ed -s <(printf "%s" test) 2> /dev/null
I do not know the memory complications this will give for a large file, but you would prefer a streaming solution.
You can use sed
for the next approach.
Keep reading lines until you find the first match. During this time only remember the last line read (by putting it in a Hold area).
Now change your tactics.
Append each line to the Hold area. You do not know when to flush until the next match.
When you have the next match, recall the Hold area and print this.
I needed some tweeking for preventing the second match to be printed twice. I solved this by reading the next line and replacing the HOLD area with that line.
The total solution is
gunzip -c inputfile.gz | sed -n '1,/CAPTURE/{h;n};H;/CAPTURE/{x;p;n;h};'
When you don't like the sed
holding space, you can implemnt the same approach with awk
:
gunzip -c inputfile.gz |
awk '/CAPTURE/{capt=1} capt==1{a[i++]=$0} /CAPTURE/{for(j=0;j<i;j++) print a[j]; i=0}'
Upvotes: 2
Reputation: 3013
You can buffer the lines until you see a line that contains CAPTURE
, treating the first occurrence of the pattern specially.
#!/usr/bin/env perl
use warnings;
use strict;
my $first=1;
my @buf;
while ( my $line = <> ) {
push @buf, $line unless $first;
if ( $line=~/CAPTURE/ ) {
if ($first) {
@buf = ($line);
$first = 0;
}
print @buf;
@buf = ();
}
}
Feed the input into this program via zcat file.gz | perl script.pl
.
Which can of course be jammed into a one-liner, if need be...
zcat file.gz | perl -ne '$x&&push@b,$_;if(/CAPTURE/){$x||=@b=$_;print@b;@b=()}'
Can this be accomplished with a regular expression?
You mean in a single pass, in a single regex? If you don't mind reading the entire file into memory, sure... but this is obviously not a good idea for large files.
zcat file.gz | perl -0777ne '/((^.*CAPTURE.*$)(?s:.*)(?2)(?:\z|\n))/m and print $1'
Upvotes: 2
Reputation: 302
Here is one more example with regex (the cons is that if files are large, it will consume a large memory)
#!/usr/bin/perl
{
local $/ = undef;
open FILE, $ARGV[0] or die "Couldn't open file: $!";
binmode FILE;
$string = <FILE>;
close FILE;
}
print $1 if $string =~ /([^\n]+(CAPTURE).*\2.*?)\n/s;
Or with one liner:
cat file.tmp | perl -ne '$/=undef; print $1 if <STDIN> =~ /([^\n]+(CAPTURE).*\2.*?)\n/s'
result:
100:CAPTURE
...........
150:CAPTURE
...........
200:CAPTURE
Upvotes: 0
Reputation: 246847
I would write
gunzip -c file.gz | sed -n '/CAPTURE/,$p' | tac | sed -n '/CAPTURE/,$p' | tac
Upvotes: 2
Reputation: 67507
I don't think regex will be faster than double scan...
Here is an awk
solution (double scan)
$ awk '/pattern/ && NR==FNR {a[++f]=NR; next} a[1]<=FNR && FNR<=a[f]' file{,}
Alternatively if you have any a priori information on where the patterns appear on the file you can have heuristic approaches which will be faster on those special cases.
Upvotes: 0