Reputation: 127
I have a file that is the output of a compute-intensive process that has experienced some kind of error that creates a large number of duplicate lines. However, some of the duplicates are correct and required to correctly parse the output. I can tell the two apart because these duplicate lines are correct only if they are consecutive and begin with a left curly brace. The order of the file is also important to maintain in order for it to be parsed correctly. Using bash, awk, or other command-line scripting, how do I delete only non-consecutive duplicate lines while retaining the original line order?
Sample "good" duplicates:
...
[0.799609,0.8016), indices: 254317-53689123
{[489,713] (0.799609), [67,489] (0.778011), [686,713] (0.762261), [67,686] (0.735254)}
{[489,713] (0.799609), [67,489] (0.778011), [686,713] (0.762261), [67,686] (0.735254)}
...
Sample "bad" duplicates:
value range: [0.665199,0.934318]
distance matrix with 1003 points, using threshold at enclosing radius 0.882441
value range: [0.665199,0.934318]
distance matrix with 1003 points, using threshold at enclosing radius 0.882441
persistent homology intervals in dim 0:
persistent homology intervals in dim 0:
[0, ): {[879]}
[0, ): {[879]}
I have tried this solution and seen this one, but this of course deletes all duplicate lines and does not preserve duplications of the type that I need to retain. I also have seen a solution that deletes only consecutive lines, but none that do the inverse.
Sample input:
value range: [0.665199,0.934318]
distance matrix with 1003 points, using threshold at enclosing radius 0.882441
value range: [0.665199,0.934318]
distance matrix with 1003 points, using threshold at enclosing radius 0.882441
persistent homology intervals in dim 0:
persistent homology intervals in dim 0:
[0, ): {[879]}
[0, ): {[879]}
persistent homology intervals in dim 1:
[0.69551,0.75602), indices: 114992-47779123
{[680,805] (0.69551), [65,329] (0.702312), [688,711] (0.713381)}
persistent homology intervals in dim 1:
[0.69551,0.75602), indices: 114992-47779123
{[680,805] (0.69551), [65,329] (0.702312), [688,711] (0.713381)}
{[680,805] (0.69551), [65,329] (0.702312), [688,711] (0.713381), [31,220] (0.729510)}
[0.799609,0.8016), indices: 254317-53689123
{[489,713] (0.799609), [67,489] (0.778011), [686,713] (0.762261), [67,686] (0.735254)}
{[489,713] (0.799609), [67,489] (0.778011), [686,713] (0.762261), [67,686] (0.735254)}
Sample output:
value range: [0.665199,0.934318]
distance matrix with 1003 points, using threshold at enclosing radius 0.882441
persistent homology intervals in dim 0:
[0, ): {[879]}
persistent homology intervals in dim 1:
[0.69551,0.75602), indices: 114992-47779123
{[680,805] (0.69551), [65,329] (0.702312), [688,711] (0.713381)}
{[680,805] (0.69551), [65,329] (0.702312), [688,711] (0.713381), [31,220] (0.729510)}
[0.799609,0.8016), indices: 254317-53689123
{[489,713] (0.799609), [67,489] (0.778011), [686,713] (0.762261), [67,686] (0.735254)}
{[489,713] (0.799609), [67,489] (0.778011), [686,713] (0.762261), [67,686] (0.735254)}
If a line is encountered both in isolation and as part of a consecutive pair of the correct form, then I wish to keep the ones that shows up as a pair. However, I don't think it's known which one will come up first in every case.
Thank you so much for your help!
Upvotes: 1
Views: 199
Reputation: 29335
The following assumes that what you want is:
With any awk
, you can try:
awk 'NR>1 && $0==p && $0~/^[{]/ {print; next} !s[$0]++ {p=$0; print}' file
NR>1
), is the same as the previous line ($0==p
) and begins with a left curly brace ($0~/^[{]/
), print it and move to next line (print; next
).!s[$0]++
) store it in variable p
and print it (p=$0; print
).Upvotes: 1
Reputation: 22062
If I'm understanding the requirement correctly, the logic will be:
{
should be specially treated:
{
, the lines are treated as good
overriding other conditions.{
duplicates either backward or forward, the line shoud be dropped.{
can be handled with the common logic to drop duplicated lines.Then two-pass processing will work:
awk '
NR==FNR { # pass-1
if (/^\{/) { # starting with "{"
seen1[$0]++ # mark it
if (prev ~ /^\{/) {good[FNR - 1]++; good[FNR]++} # treat consecutive lines as "good"
}
prev = $0 # remember current line
next # skip pass-2
}
(/^\{/ && (good[FNR] || !seen1[$0])) || (!/^\{/ && !seen2[$0]++) # pass-2: print the lines which meet the condition
' file file
Upvotes: 2
Reputation: 36735
Using(...)awk(...)how do I delete only non-consecutive duplicate lines while retaining the original line order?
I would harness GNU AWK
for this task following way, let file.txt
content be
dropthis
keepthis
keepthis
dropthis
single
then
awk 'FNR==NR&&($0 in arr)&&arr[$0]+1!=FNR{exc[arr[$0]];exc[FNR]}FNR==NR{arr[$0]=FNR;next}!(FNR in exc)' file.txt file.txt
gives output
keepthis
keepthis
single
Explantion: this is so-called 2-pass approach (observe name of file given twice). During 1st pass I identify lines to excluded, in order to so I use 2 arrays named arr
and exc
. 1st is used for storing number of row where given line was lastly seen. If current line was already seen ($0 in arr
) and its position is other than directly previous I add to exc
number of row where line was seen and current number of row. After doing that I add current line to array arr
and corresponding number of row and instruct GNU AWK
to go to next
line as further pattern is for 2nd pass. In 2nd pass I filter lines, print
ing (default action) only line where number is not present in exc
array keys.
(tested in GNU Awk 5.3.1)
EDIT: legible version of the above script courtesy of gawk -o-
:
awk '
FNR == NR && ($0 in arr) && arr[$0] + 1 != FNR {
exc[arr[$0]]
exc[FNR]
}
FNR == NR {
arr[$0] = FNR
next
}
! (FNR in exc) {
print
}
' file.txt file.txt
Upvotes: -1