Reputation: 28695
This question has a great answer saying you can use awk '!seen[$0]++' file.txt
to delete non-consecutive duplicate lines from a file. How can I delete non-consecutive duplicate lines from a file only if they match a pattern? e.g. only if they contain the string "#####"
Example input
deleteme.txt ##########
1219: 'PCM BE PTP'
deleteme.txt ##########
1221: , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
deleteme2.txt ##########
1222: , 'PCM BE PTP UT'
1221: , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
deleteme2.txt ##########
1223: , 'PCM BE PTP'
1221: , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
deleteme2.txt ##########
1225: , 'PCM FE/MID PTP'
Desired output
deleteme.txt ##########
1219: 'PCM BE PTP'
1221: , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
deleteme2.txt ##########
1222: , 'PCM BE PTP UT'
1221: , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
1223: , 'PCM BE PTP'
1221: , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
1225: , 'PCM FE/MID PTP'
Upvotes: 5
Views: 2176
Reputation: 58453
This might work for you (GNU sed):
sed '/#$/{G;/^\(\S*\s\).*\1/!P;h;d}' file
All lines other than those of interest are printed as normal.
Append previous lines of interest to the current line and using pattern matching, if such a line has not been encounter before, print it. Then store the pattern space back in the hold space, ready for the next match and delete the pattern space.
Upvotes: 0
Reputation: 8711
Try this Perl command line regex solution using file slurp mode.
perl -0777 -ne ' $z=$y=$_;
while( $y ne $x)
{ $z=~s/(^[^\n]+?\s+##########.*?$)(.+?)\K(\1\n)//gmse ; $x=$y ;$y=$z } ;
print "$z" '
with the given inputs
$ cat toucan.txt
deleteme.txt ##########
1219: 'PCM BE PTP'
deleteme.txt ##########
1221: , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
deleteme2.txt ##########
1222: , 'PCM BE PTP UT'
1221: , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
deleteme2.txt ##########
1223: , 'PCM BE PTP'
1221: , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
deleteme2.txt ##########
1225: , 'PCM FE/MID PTP'
$ perl -0777 -ne ' $z=$y=$_; while( $y ne $x) { $z=~s/(^[^\n]+?\s+##########.*?$)(.+?)\K(\1\n)//gmse ; $x=$y ;$y=$z } ; print "$z" ' toucan.txt
deleteme.txt ##########
1219: 'PCM BE PTP'
1221: , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
deleteme2.txt ##########
1222: , 'PCM BE PTP UT'
1221: , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
1223: , 'PCM BE PTP'
1221: , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
1225: , 'PCM FE/MID PTP'
$
Upvotes: 2
Reputation: 626950
You may use
awk '!/#####/ || !seen[$0]++'
Or, as Ed Morton suggests, a synonymical
awk '!(/#####/ && seen[$0]++)'
Here, !seen[$0]++
does the same thing as usual, it will remove any duplicated line. The !/#####/
part matches lines that contain a #####
pattern and negates the match. The two patterns combined with ||
will remove all duplicate lines having #####
pattern inside them.
See an online awk
demo:
s="deleteme.txt ##########
1219: 'PCM BE PTP'
deleteme.txt ##########
1221: , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
deleteme2.txt ##########
1222: , 'PCM BE PTP UT'
1221: , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
deleteme2.txt ##########
1223 #####: , 'PCM BE PTP'
1221: , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
deleteme2.txt ##########
1225: , 'PCM FE/MID PTP'"
awk '!/#####/ || !seen[$0]++' <<< "$s"
Output:
deleteme.txt ##########
1219: 'PCM BE PTP'
1221: , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
deleteme2.txt ##########
1222: , 'PCM BE PTP UT'
1221: , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
1223 #####: , 'PCM BE PTP'
1221: , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
1225: , 'PCM FE/MID PTP'
Upvotes: 8
Reputation: 376
Whenever I think about matching patterns and selective printing, I think of the the Practical Extraction and Report Language: Perl! Here's a Perl one-liner that does what you're asking. You should be able to copy-paste this into a shell and have it work:
perl -wnle 'BEGIN { $rows_with_five_hashes = {}; } $thisrow = $_; if ($thisrow =~ /[#]{5}/) { if (!exists $rows_with_five_hashes->{$thisrow}) { print; } $rows_with_five_hashes->{$thisrow}++; } else { print; }' input.txt
Here's the same Perl with line breaks and comments for clarity (note: this isn't executable as-is):
BEGIN {
# create a counter for rows that match the pattern
$rows_with_five_hashes = {};
}
# capture the row from the input file
$thisrow = $_;
if ($thisrow =~ /[#]{5}/) {
if (!exists $rows_with_five_hashes->{$thisrow}) {
# this row matches the pattern and we haven't seen it before
print;
}
# Increment the counter for rows that match the pattern.
# Do this AFTER we print, or else our "exists" print logic fails.
$rows_with_five_hashes->{$thisrow}++;
}
else {
# print all rows that don't match the pattern
print;
}
Ruby has similar "one-liner" functionality for running code directly on the command line (much of which it borrowed from Perl).
For more info on the wnle
command line switches, check out the Perl docs about that. If you had many files you wanted to modify in place and keep backup copies of the originals with a single Perl command, check out the -i
switch in those docs.
If you found yourself running this all the time and wanted to keep a handy executable script, you could adapt this pretty easily to run on just about any system that has a Perl interpreter.
Upvotes: 0