Reputation: 8711
I'm running several spark jobs which produces like log below when the job is waiting for resources.
22/01/03 14:42:25 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:27 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:29 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:31 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:33 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:35 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:37 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:39 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
After the job completes, I want to reduce the log file size by removing the redundant msgs by executing some perl command. I want the output like below with first and last line alone.
22/01/03 14:42:25 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:39 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
I tried something like below but since the timestamp is changing, I'm not able to use the greedy operator.
perl -0777 -ne ' { s/(^\d\d\/\d\d\/\d\d \d\d:.+? Could not get any http protocol, using HTTP and will try to get protocol again.)+/$1/mg;print } ' log-file.
In actual scenario, the log section could repeat multiple times. Something like this.
sometext1
sometext2
22/01/03 14:42:25 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:27 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:29 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:31 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:33 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:35 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:37 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:39 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
sometext3
sometext4
22/01/03 14:42:49 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:51 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:51 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:51 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:53 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
sometext5
sometext6
Can this be solved using regex?
Required output:
sometext1
sometext2
22/01/03 14:42:25 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:39 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
sometext3
sometext4
22/01/03 14:42:49 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:53 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
sometext5
sometext6
Upvotes: 0
Views: 122
Reputation: 66883
A regex solution
use warnings;
use strict;
use feature 'say';
die "Usage: $0 file\n" if not @ARGV;
my $fc = do { local $/; <> }; # file content
my $d2 = qr{[0-9]{2}};
my $ts = qr{$d2/$d2/$d2 $d2:$d2:$d2};
$fc =~ s{ ($ts\s*(.*?)\n) (?: $ts\s*\g{-1}\n )+ ( $ts\s*\g{-2}\n ) }{$1$3}gx;
say $fc;
It matches a timestamp, then the rest of the line. Then, a non-capturing group timestamp + what-it-last-captured† is matched as many times as possible up to another such pattern, since we need the first and the last while timestamps differ. It stops at any other text. Then it repeates, by /g
.
Another way to retain the first-and-last lines in each group is to capture inside the non-capturing group, and use that in the replacement as it'll be the last such pattern in the group of lines
$fc =~ s{ ($ts\s*(.*?)\n) (?: ($ts\s*\g{2}\n) )+ }{$1$3}gx;
Now we need to count for our backreference with \g{2}
, not use a relative one.
Both variants above produce the desired output with the given input
sometext1
sometext2
22/01/03 14:42:25 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:39 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
sometext3
sometext4
22/01/03 14:42:49 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:53 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
sometext5
sometext6
This wasn't tested for possible weird input or edge cases, other than for the one provided in TLP's answer (thanks TLP), but I don't expect surprises. Please report if there's unexpected behavior.
This can be squeezed into an one-liner, if for some reason that's required.
† Via a relative backreference, \g{-1}
Upvotes: 2
Reputation: 67900
While using a regex may be possible, this can easily be solved with normal Perl code. I think the code is clearer and easier to maintain. I added 3 lines to your sample input to test for the edge case that we end on a line which matches our search.
use strict;
use warnings;
# This string can be replaced as needed
my $str = "INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again";
my ($first, $last);
while (<DATA>) {
if (/\Q$str/) { # if pattern matches current line
if ($first) { # if this is an "in between" line
$last = $_; # save line and go next
} else { # if this is the first line
print if not eof; # print it..
$first = $_; # ...save line and go next
}
print if eof; # print last line to avoid edge cases
} elsif ($first && $last) { # $str didn't match: finished a range of lines
print $last, $_; # print and reset
$first = undef;
$last = undef;
} else {
print; # print everything else
}
}
__DATA__
sometext1
sometext2
22/01/03 14:42:25 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:27 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:29 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:31 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:33 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:35 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:37 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:39 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
sometext3
sometext4
22/01/03 14:42:49 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:51 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:51 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:51 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:53 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
sometext5
sometext6
22/01/03 14:42:49 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:51 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:51 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
Output:
sometext1
sometext2
22/01/03 14:42:25 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:39 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
sometext3
sometext4
22/01/03 14:42:49 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:53 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
sometext5
sometext6
22/01/03 14:42:49 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:51 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
Upvotes: 2
Reputation: 10360
You can try this regex:
^(\d{2}(?:\/\d{2}){2} \d{2}(?::\d{2}){2}(.*$)\s*)(\d{2}(?:\/\d{2}){2} \d{2}(?::\d{2}){2}\2\s*)+
Explanation:
^
- matches the start of a line(\d{2}(?:\/\d{2}){2} \d{2}(?::\d{2}){2}(.*$)\s*)
- First log line is stored as group 1
(\d{2}(?:\/\d{2}){2} \d{2}(?::\d{2}){2}
- matches the pattern of format XX/XX/XX XX:XX:XX
where X
is a digit(.*$)
- matches everything until the end of the line. Whatever is matched is stored in Group 2. The actual log(without the timestamp) is stored in this group.\s*
- matches 0 or more whitespaces(\d{2}(?:\/\d{2}){2} \d{2}(?::\d{2}){2}\2\s*)+
- matches all the remaining continuous log lines starting with the format XX/XX/XX XX:XX:XX
followed by contents of group 2 but only the last such log line will be stored in group 3Now, replace each match with contents of group 1 followed by group 3 $1$3
Upvotes: 2