Reputation: 28695

Delete duplicate lines only if they match a pattern

This question has a great answer saying you can use awk '!seen[$0]++' file.txt to delete non-consecutive duplicate lines from a file. How can I delete non-consecutive duplicate lines from a file only if they match a pattern? e.g. only if they contain the string "#####"

Example input

deleteme.txt ##########
1219:                            'PCM BE PTP'
deleteme.txt ##########
1221:                          , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
deleteme2.txt ##########
1222:                          , 'PCM BE PTP UT'
1221:                          , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
deleteme2.txt ##########
1223:                          , 'PCM BE PTP'
1221:                          , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
deleteme2.txt ##########
1225:                          , 'PCM FE/MID PTP'

Desired output

deleteme.txt ##########
1219:                            'PCM BE PTP'
1221:                          , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
deleteme2.txt ##########
1222:                          , 'PCM BE PTP UT'
1221:                          , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
1223:                          , 'PCM BE PTP'
1221:                          , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
1225:                          , 'PCM FE/MID PTP'

Upvotes: 5

Answers (4)

potong

Reputation: 58453

This might work for you (GNU sed):

sed '/#$/{G;/^\(\S*\s\).*\1/!P;h;d}' file

All lines other than those of interest are printed as normal.

Append previous lines of interest to the current line and using pattern matching, if such a line has not been encounter before, print it. Then store the pattern space back in the hold space, ready for the next match and delete the pattern space.

Upvotes: 0

stack0114106

Reputation: 8711

Try this Perl command line regex solution using file slurp mode.

perl -0777 -ne ' $z=$y=$_; 
                 while( $y ne $x) 
                 { $z=~s/(^[^\n]+?\s+##########.*?$)(.+?)\K(\1\n)//gmse ; $x=$y ;$y=$z } ; 
                 print "$z" '

with the given inputs

$ cat toucan.txt
deleteme.txt ##########
1219:                            'PCM BE PTP'
deleteme.txt ##########
1221:                          , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
deleteme2.txt ##########
1222:                          , 'PCM BE PTP UT'
1221:                          , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
deleteme2.txt ##########
1223:                          , 'PCM BE PTP'
1221:                          , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
deleteme2.txt ##########
1225:                          , 'PCM FE/MID PTP'

$ perl -0777 -ne ' $z=$y=$_; while( $y ne $x) { $z=~s/(^[^\n]+?\s+##########.*?$)(.+?)\K(\1\n)//gmse ; $x=$y ;$y=$z } ; print "$z" ' toucan.txt
deleteme.txt ##########
1219:                            'PCM BE PTP'
1221:                          , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
deleteme2.txt ##########
1222:                          , 'PCM BE PTP UT'
1221:                          , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
1223:                          , 'PCM BE PTP'
1221:                          , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
1225:                          , 'PCM FE/MID PTP'

$

Upvotes: 2

Wiktor Stribiżew

Reputation: 626950

You may use

awk '!/#####/ || !seen[$0]++'

Or, as Ed Morton suggests, a synonymical

awk '!(/#####/ && seen[$0]++)'

Here, !seen[$0]++ does the same thing as usual, it will remove any duplicated line. The !/#####/ part matches lines that contain a ##### pattern and negates the match. The two patterns combined with || will remove all duplicate lines having ##### pattern inside them.

See an online awk demo:

s="deleteme.txt ##########
1219:                            'PCM BE PTP'
deleteme.txt ##########
1221:                          , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
deleteme2.txt ##########
1222:                          , 'PCM BE PTP UT'
1221:                          , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
deleteme2.txt ##########
1223  #####:                          , 'PCM BE PTP'
1221:                          , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
deleteme2.txt ##########
1225:                          , 'PCM FE/MID PTP'"
awk '!/#####/ || !seen[$0]++' <<< "$s"

Output:

deleteme.txt ##########
1219:                            'PCM BE PTP'
1221:                          , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
deleteme2.txt ##########
1222:                          , 'PCM BE PTP UT'
1221:                          , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
1223  #####:                          , 'PCM BE PTP'
1221:                          , 'PCM FE/MID PTP UT','PCM IA 1 PTP'
1225:                          , 'PCM FE/MID PTP'

Upvotes: 8

Jamin Kortegard

Reputation: 376

Whenever I think about matching patterns and selective printing, I think of the the Practical Extraction and Report Language: Perl! Here's a Perl one-liner that does what you're asking. You should be able to copy-paste this into a shell and have it work:

perl -wnle 'BEGIN { $rows_with_five_hashes = {}; } $thisrow = $_; if ($thisrow =~ /[#]{5}/) { if (!exists $rows_with_five_hashes->{$thisrow}) { print; } $rows_with_five_hashes->{$thisrow}++; } else { print; }' input.txt

Here's the same Perl with line breaks and comments for clarity (note: this isn't executable as-is):

BEGIN {
  # create a counter for rows that match the pattern
  $rows_with_five_hashes = {}; 
} 
# capture the row from the input file
$thisrow = $_;
if ($thisrow =~ /[#]{5}/) { 
  if (!exists $rows_with_five_hashes->{$thisrow}) { 
    # this row matches the pattern and we haven't seen it before
    print; 
  } 
  # Increment the counter for rows that match the pattern.
  # Do this AFTER we print, or else our "exists" print logic fails.
  $rows_with_five_hashes->{$thisrow}++;
} 
else { 
  # print all rows that don't match the pattern
  print;
}

Ruby has similar "one-liner" functionality for running code directly on the command line (much of which it borrowed from Perl).

For more info on the wnle command line switches, check out the Perl docs about that. If you had many files you wanted to modify in place and keep backup copies of the originals with a single Perl command, check out the -i switch in those docs.

If you found yourself running this all the time and wanted to keep a handy executable script, you could adapt this pretty easily to run on just about any system that has a Perl interpreter.

Upvotes: 0

Delete duplicate lines only if they match a pattern

Answers (4)

Related Questions