remove duplicate lines in each paragraph of a file using sed or awk

Question

I want to remove the duplicate lines from pargraphs that begin with "SET CURRENT" in a file, that share the same first line and have the same sentences and I don't remove the duplicate lines that belong to different paragraphs for example:

if I have the following file:

SET CURRENT = 'aaa' ;
CREATE SYN file1 FOR 1000.file1 ;
CREATE SYN file2 FOR 1000.file2 ;
CREATE SYN file3 FOR 1001.file3 ;
CREATE SYN file3 FOR 1001.file3 ;

SET CURRENT = 'aaa' ;
CREATE SYN file1 FOR 1000.file1 ;
CREATE SYN file2 FOR 1000.file2 ;
CREATE SYN file7 FOR 1000.file7 ;

SET CURRENT = 'bbb' ;
CREATE SYN file5 FOR 1002.file5 ;
CREATE SYN file6 FOR 1003.file6 ;

SET CURRENT = 'bbb' ;  
CREATE SYN file1 FOR 1000.file1 ;
CREATE SYN file8 FOR 1002.file8 ;
CREATE SYN file6 FOR 1003.file6 ;

the result would be like

SET CURRENT = 'aaa' ;
CREATE SYN file1 FOR 1000.file1 ;
CREATE SYN file2 FOR 1000.file2 ;
CREATE SYN file3 FOR 1001.file3 ;

SET CURRENT = 'aaa' ;
CREATE SYN file7 FOR 1000.file7 ;

SET CURRENT = 'bbb' ;
CREATE SYN file5 FOR 1002.file5 ;
CREATE SYN file6 FOR 1003.file6 ;

SET CURRENT = 'bbb' ;
CREATE SYN file1 FOR 1000.file1 ;
CREATE SYN file8 FOR 1002.file8 ;

user000001 · Accepted Answer

With awk you could do something like this:

awk 'NF==0{print;next};/^SET CURRENT/{c=$4;print;next}!seen[c,$0]++' file

With some comments to make it more readable:

awk ' NF == 0 {       # If we find an empty line
          print       # print the line
          next        # and skip to the next record
      }
      /^SET CURRENT/{ # If we find a line beginning wiith "SET CURRENT"
          c = $4      # Store the value in the 4th field
          print       # Print the current line
          next        # and skip to the next record  
      }
      !seen[c,$0]++  # Print if the combination of the "c" value
                      # and the current line has not been stored 
                      # in array "seen", and then store the
                      # combination in the array
                      # (in order to prevent other lines to be printed)
      ' file

The !seen[c,$0]++ works like this: when we use a comma in an array index, the two tokens are combined into a single string joined by the SUBSEP character. In this case we use as an index the combination of the c character and the current line ($0), since that is what needs to be unique after the filtering. With !seen[c,$0] we check to see if the combination exists as an index to the array. If the index is not present, then the expression evaluates to true, which results to the line being printed. If the index is present, then the expression evaluates to false, and the line is not printed. With the post-fix increment operator we count the occurrences of the index, so that the line will be printed only at the first occurrence, but not for subsequent matches.

remove duplicate lines in each paragraph of a file using sed or awk

Answers (2)

Related Questions