Kspael
Kspael

Reputation: 177

Remove duplicate lines only if the duplicate items are within 5 lines of each other

I want to delete the duplicate lines of a text file only if the duplicate items are within 5 lines of each other.

For example :

Chapter 1.1
Overview
Figure 1
Figure 2
Overview <- This should be deleted (ie. within 5 lines of the previous instance) 
Figure 3
Figure 4
...

(many lines in between)

Chapter 1.2
Overview <- This should not be deleted (ie. not within 5 lines of the previous instance)

I tried to use awk '!a[$0]++' but this will delete all the duplicates lines on the entire file. I also tried with a loop and sed -n "$startpoint,$endpoint p" file.txt | awk '!a[$0]++' but this actually creates new duplicates...

What other approaches can I try to remove the duplicates lines that are within 5 lines of each other?

Upvotes: 11

Views: 1264

Answers (5)

user14473238
user14473238

Reputation:

awk '{f=0; for (i in a) if (a[i]==$0) {f=1; break} a[NR%5]=$0} !f' file

This keeps an array containing the previous 5 lines. The current line is only printed if it is not found in the array.

Upvotes: 2

anubhava
anubhava

Reputation: 785481

You may use this shorter awk command:

awk '!NF || NR > rec[$0]; {rec[$0] = NR+5}' file

Chapter 1.1
Overview
Figure 1
Figure 2
Figure 3
Figure 4
...

(many lines in between)

Chapter 1.2
Figure 1
Figure 2
Overview

Algorithm Details:

  • !NF || NR > rec[$0];: Print each record if current line is empty OR if current record number is greater than the value we have in array rec for current record. When $0 doesn't exist in rec then also line will be printed. Line will not be printed only when we are within 5 lines from stored value in rec.
  • {rec[$0] = NR+5}: Save each record in array rec with value as current line no + 5

Upvotes: 14

anon
anon

Reputation:

One approach you could try is keeping the preceding four lines in a variable and searching the current one in it. Something like:

awk '{
  idx = index(buf, $0 "\n")
  if (!idx)
    print
  else if (idx != 1 && substr(buf, idx - 1, 1) != "\n")
    print
  if (NR > 4)
    sub(/[^\n]*\n/, "", buf)
  buf = buf $0 "\n"
}' file

For something less wobbly and cumbersome, see this answer.

Upvotes: 2

RavinderSingh13
RavinderSingh13

Reputation: 133600

1st solution: A single Input_file pass solution.

awk '
{
  arr[FNR]=$0
}
END{
  for(i=1;i<=FNR;i++){
    count=0
    for(j=i;j>=(i-5);j--){
      if(arr[i]!=arr[j]){ count++      }
    }
    if(count==5)        { print arr[i] }
  }
}
'  Input_file


2nd solution: With your shown samples and a 2 pass of Input_file; one could try following also. Fair warning it could be slow for if dataset is huge.

awk '
FNR==NR{
  arr[FNR]=$0
  next
}
{
  count=0
  for(i=FNR;i>=(FNR-5);i--){
    if($0!=arr[i]){ count++ }
  }
  if(count==5)    { print   }
}
' Input_file Input_file

Explanation: Adding detailed explanation for above.

awk '                            ##Starting awk program from here.
FNR==NR{                         ##Checking condition which will be true 1st time Input_file is being read.
  arr[FNR]=$0                    ##Creating arr with index of current line number and value is current line.
  next                           ##next will skip all further statements from here.
}
{
  count=0                        ##Nullifying count here.
  for(i=FNR;i>=(FNR-5);i--){     ##Running a loop here for 5 count.
    if($0!=arr[i]){ count++ }    ##Checking condition if current line is not equal to array value then increase count with 1 here.
  }
  if(count==5)    { print   }    ##Checking condition if count is 5 then print line.
}
' Input_file Input_file          ##Mentioning Input_file names here.


3rd solution:

awk '!arr[$0]++;++count==5{delete arr;count=0}' Input_file

NOTE: 1st and 2nd solution considers that one wants to compare each line with its next 5 lines(eg: 1-6, 2-7 and so on....). Where 3rd solution considers that one wants to remove duplicate within each set of 5 lines(eg: 1-5, 6-10 and so on....)

Upvotes: 4

jurez
jurez

Reputation: 4667

I recommend you use an array of 5 variables, initialized to (0, 0, 0, 0, 0).

In a loop, you read the file line by file, and:

  • you shift the array by one position, e.g. a[0] = a[1], a[1] = a[2], ..., a[4] = a[5]
  • you set the last element (a[5]) to 1 if the line matches the string (e.g. with grep), or to 0 otherwise
  • if the array contains 1, you skip the line, otherwise you print it

Upvotes: 2

Related Questions