Reputation: 177
I want to delete the duplicate lines of a text file only if the duplicate items are within 5 lines of each other.
For example :
Chapter 1.1
Overview
Figure 1
Figure 2
Overview <- This should be deleted (ie. within 5 lines of the previous instance)
Figure 3
Figure 4
...
(many lines in between)
Chapter 1.2
Overview <- This should not be deleted (ie. not within 5 lines of the previous instance)
I tried to use awk '!a[$0]++'
but this will delete all the duplicates lines on the entire file. I also tried with a loop and sed -n "$startpoint,$endpoint p" file.txt | awk '!a[$0]++'
but this actually creates new duplicates...
What other approaches can I try to remove the duplicates lines that are within 5 lines of each other?
Upvotes: 11
Views: 1264
Reputation:
awk '{f=0; for (i in a) if (a[i]==$0) {f=1; break} a[NR%5]=$0} !f' file
This keeps an array containing the previous 5 lines. The current line is only printed if it is not found in the array.
Upvotes: 2
Reputation: 785481
You may use this shorter awk
command:
awk '!NF || NR > rec[$0]; {rec[$0] = NR+5}' file
Chapter 1.1
Overview
Figure 1
Figure 2
Figure 3
Figure 4
...
(many lines in between)
Chapter 1.2
Figure 1
Figure 2
Overview
Algorithm Details:
!NF || NR > rec[$0];
: Print each record if current line is empty OR if current record number is greater than the value we have in array rec
for current record. When $0
doesn't exist in rec
then also line will be printed. Line will not be printed only when we are within 5
lines from stored value in rec
.{rec[$0] = NR+5}
: Save each record in array rec
with value as current line no + 5
Upvotes: 14
Reputation:
One approach you could try is keeping the preceding four lines in a variable and searching the current one in it. Something like:
awk '{
idx = index(buf, $0 "\n")
if (!idx)
print
else if (idx != 1 && substr(buf, idx - 1, 1) != "\n")
print
if (NR > 4)
sub(/[^\n]*\n/, "", buf)
buf = buf $0 "\n"
}' file
For something less wobbly and cumbersome, see this answer.
Upvotes: 2
Reputation: 133600
1st solution: A single Input_file pass solution.
awk '
{
arr[FNR]=$0
}
END{
for(i=1;i<=FNR;i++){
count=0
for(j=i;j>=(i-5);j--){
if(arr[i]!=arr[j]){ count++ }
}
if(count==5) { print arr[i] }
}
}
' Input_file
2nd solution: With your shown samples and a 2 pass of Input_file; one could try following also. Fair warning it could be slow for if dataset is huge.
awk '
FNR==NR{
arr[FNR]=$0
next
}
{
count=0
for(i=FNR;i>=(FNR-5);i--){
if($0!=arr[i]){ count++ }
}
if(count==5) { print }
}
' Input_file Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be true 1st time Input_file is being read.
arr[FNR]=$0 ##Creating arr with index of current line number and value is current line.
next ##next will skip all further statements from here.
}
{
count=0 ##Nullifying count here.
for(i=FNR;i>=(FNR-5);i--){ ##Running a loop here for 5 count.
if($0!=arr[i]){ count++ } ##Checking condition if current line is not equal to array value then increase count with 1 here.
}
if(count==5) { print } ##Checking condition if count is 5 then print line.
}
' Input_file Input_file ##Mentioning Input_file names here.
3rd solution:
awk '!arr[$0]++;++count==5{delete arr;count=0}' Input_file
NOTE: 1st and 2nd solution considers that one wants to compare each line with its next 5 lines(eg: 1-6, 2-7 and so on....). Where 3rd solution considers that one wants to remove duplicate within each set of 5 lines(eg: 1-5, 6-10 and so on....)
Upvotes: 4
Reputation: 4667
I recommend you use an array of 5 variables, initialized to (0, 0, 0, 0, 0)
.
In a loop, you read the file line by file, and:
a[0] = a[1], a[1] = a[2], ..., a[4] = a[5]
a[5]
) to 1
if the line matches the string (e.g. with grep
), or to 0
otherwise1
, you skip the line, otherwise you print itUpvotes: 2