tyo
tyo

Reputation: 127

bash - delete only non-consecutive duplicate lines without changing file order

I have a file that is the output of a compute-intensive process that has experienced some kind of error that creates a large number of duplicate lines. However, some of the duplicates are correct and required to correctly parse the output. I can tell the two apart because these duplicate lines are correct only if they are consecutive and begin with a left curly brace. The order of the file is also important to maintain in order for it to be parsed correctly. Using bash, awk, or other command-line scripting, how do I delete only non-consecutive duplicate lines while retaining the original line order?

Sample "good" duplicates:

...
 [0.799609,0.8016), indices: 254317-53689123
{[489,713] (0.799609), [67,489] (0.778011), [686,713] (0.762261), [67,686] (0.735254)}
{[489,713] (0.799609), [67,489] (0.778011), [686,713] (0.762261), [67,686] (0.735254)}
...

Sample "bad" duplicates:

value range: [0.665199,0.934318]
distance matrix with 1003 points, using threshold at enclosing radius 0.882441
value range: [0.665199,0.934318]
distance matrix with 1003 points, using threshold at enclosing radius 0.882441
persistent homology intervals in dim 0:
persistent homology intervals in dim 0:
 [0, ):  {[879]}
 [0, ):  {[879]}

I have tried this solution and seen this one, but this of course deletes all duplicate lines and does not preserve duplications of the type that I need to retain. I also have seen a solution that deletes only consecutive lines, but none that do the inverse.


Sample input:

value range: [0.665199,0.934318]
distance matrix with 1003 points, using threshold at enclosing radius 0.882441
value range: [0.665199,0.934318]
distance matrix with 1003 points, using threshold at enclosing radius 0.882441
persistent homology intervals in dim 0:
persistent homology intervals in dim 0:
 [0, ):  {[879]}
 [0, ):  {[879]}
persistent homology intervals in dim 1:
 [0.69551,0.75602), indices: 114992-47779123
{[680,805] (0.69551), [65,329] (0.702312), [688,711] (0.713381)}
persistent homology intervals in dim 1:
 [0.69551,0.75602), indices: 114992-47779123
{[680,805] (0.69551), [65,329] (0.702312), [688,711] (0.713381)}
{[680,805] (0.69551), [65,329] (0.702312), [688,711] (0.713381), [31,220] (0.729510)}
 [0.799609,0.8016), indices: 254317-53689123
{[489,713] (0.799609), [67,489] (0.778011), [686,713] (0.762261), [67,686] (0.735254)}
{[489,713] (0.799609), [67,489] (0.778011), [686,713] (0.762261), [67,686] (0.735254)}

Sample output:

value range: [0.665199,0.934318]
distance matrix with 1003 points, using threshold at enclosing radius 0.882441
persistent homology intervals in dim 0:
 [0, ):  {[879]}
persistent homology intervals in dim 1:
 [0.69551,0.75602), indices: 114992-47779123
{[680,805] (0.69551), [65,329] (0.702312), [688,711] (0.713381)}
{[680,805] (0.69551), [65,329] (0.702312), [688,711] (0.713381), [31,220] (0.729510)}
 [0.799609,0.8016), indices: 254317-53689123
{[489,713] (0.799609), [67,489] (0.778011), [686,713] (0.762261), [67,686] (0.735254)}
{[489,713] (0.799609), [67,489] (0.778011), [686,713] (0.762261), [67,686] (0.735254)}

If a line is encountered both in isolation and as part of a consecutive pair of the correct form, then I wish to keep the ones that shows up as a pair. However, I don't think it's known which one will come up first in every case.

Thank you so much for your help!

Upvotes: 1

Views: 199

Answers (3)

Renaud Pacalet
Renaud Pacalet

Reputation: 29335

The following assumes that what you want is:

  • If we encounter a line for the first time we always print it.
  • If we encounter a line that was already encountered we print it only if it begins with a left curly brace and it is the same as the previous line and the previous line has been printed.
  • We drop any other line.

With any awk, you can try:

awk 'NR>1 && $0==p && $0~/^[{]/ {print; next} !s[$0]++ {p=$0; print}' file
  • If the current line is not the first (NR>1), is the same as the previous line ($0==p) and begins with a left curly brace ($0~/^[{]/), print it and move to next line (print; next).
  • Else, if the current line has not already been seen (!s[$0]++) store it in variable p and print it (p=$0; print).

Upvotes: 1

tshiono
tshiono

Reputation: 22062

If I'm understanding the requirement correctly, the logic will be:

  • The lines starting with { should be specially treated:
    • If two (or more?) consecutive lines start with {, the lines are treated as good overriding other conditions.
    • If a single (non-consecutive) line starting with { duplicates either backward or forward, the line shoud be dropped.
  • The lines not starting with { can be handled with the common logic to drop duplicated lines.

Then two-pass processing will work:

awk '
    NR==FNR {                                                           # pass-1
        if (/^\{/) {                                                    # starting with "{"
            seen1[$0]++                                                 # mark it
            if (prev ~ /^\{/) {good[FNR - 1]++; good[FNR]++}            # treat consecutive lines as "good"
        }
        prev = $0                                                       # remember current line
        next                                                            # skip pass-2
    }
    (/^\{/ && (good[FNR] || !seen1[$0])) || (!/^\{/ && !seen2[$0]++)    # pass-2: print the lines which meet the condition
' file file

Upvotes: 2

Daweo
Daweo

Reputation: 36735

Using(...)awk(...)how do I delete only non-consecutive duplicate lines while retaining the original line order?

I would harness GNU AWK for this task following way, let file.txt content be

dropthis
keepthis
keepthis
dropthis
single

then

awk 'FNR==NR&&($0 in arr)&&arr[$0]+1!=FNR{exc[arr[$0]];exc[FNR]}FNR==NR{arr[$0]=FNR;next}!(FNR in exc)' file.txt file.txt

gives output

keepthis
keepthis
single

Explantion: this is so-called 2-pass approach (observe name of file given twice). During 1st pass I identify lines to excluded, in order to so I use 2 arrays named arr and exc. 1st is used for storing number of row where given line was lastly seen. If current line was already seen ($0 in arr) and its position is other than directly previous I add to exc number of row where line was seen and current number of row. After doing that I add current line to array arr and corresponding number of row and instruct GNU AWK to go to next line as further pattern is for 2nd pass. In 2nd pass I filter lines, printing (default action) only line where number is not present in exc array keys.

(tested in GNU Awk 5.3.1)


EDIT: legible version of the above script courtesy of gawk -o-:

awk '

    FNR == NR && ($0 in arr) && arr[$0] + 1 != FNR {
        exc[arr[$0]]
        exc[FNR]
    }
    
    FNR == NR {
        arr[$0] = FNR
        next
    }
    
    ! (FNR in exc) {
        print
    }

' file.txt file.txt

Upvotes: -1

Related Questions