Reputation: 2813

Delete n duplicate lines in a file

1. Briefly

I have a large text file (14MB). I need to remove in a file text blocks, contains 5 duplicate lines.

It would be nice, if would be possible make it use any gratis method.

I use Windows, but Cygwin solutions also would be nice.

2. Settings

1. File structure

I have a file test1.md. It consists of repeating blocks. Each block has a 10 lines. Structure of file (using PCRE regular expressions)

Millionaire
\d{18}
QUESTION.*
.*
.*
.*
.*
.*
.*
.*
Millionaire
\d{18}
QUESTION.*
.*
.*
.*
.*
.*
.*
.*

test1.md doesn't have another lines and text besides 10-lines blocks. It doesn't have blank lines and blocks with a greater or lesser number of lines than 10.

2. Example content of file

Millionaire
123456788763237476
QUESTION|2402394827049882049
Who is the greatest Goddess of the world?
Sasha
Kristina
Sasha
Katya
Valeria
AuthorOfQuestion
Millionaire
459385734954395394
QUESTION|9845495845948594999
Where Sasha live?
Novgorod
St. Petersburg
Kazan
Novgorod
Chistopol
Another author
Millionaire
778845225202502505
QUESTION|984ACFBBADD8594999A
Who is the greatest Goddess of the world?
Sasha
Kristina
Sasha
Katya
Valeria
Millionaire
AuthorOfQuestion
Millionaire
903034225025025568
QUESTION|ABC121980850540445C
Another question.
Katya
Sasha
Kazan
Chistopol
Katya
Unknown author
Millionaire
450602938477581129
QUESTION|453636EE4534345AC5E
Where Sasha live?
Novgorod
St. Petersburg
Kazan
Novgorod
Chistopol
Another author

As can be seen in the example, test1.md has repeated 7-lines blocks. In example, these blocks is:

Who is the greatest Goddess of the world?
Sasha
Kristina
Sasha
Katya
Valeria
AuthorOfQuestion

and

Where Sasha live?
Novgorod
St. Petersburg
Kazan
Novgorod
Chistopol
Another author

3. Expected behavior

I need to remove all repeat blocks. In my example I need to get:

Millionaire
123456788763237476
QUESTION|2402394827049882049
Who is the greatest Goddess of the world?
Sasha
Kristina
Sasha
Katya
Valeria
AuthorOfQuestion
Millionaire
459385734954395394
QUESTION|9845495845948594999
Where Sasha live?
Novgorod
St. Petersburg
Kazan
Novgorod
Chistopol
Another author
Millionaire
778845225202502505
QUESTION|984ACFBBADD8594999A
Millionaire
903034225025025568
QUESTION|ABC121980850540445C
Another question.
Katya
Sasha
Kazan
Chistopol
Katya
Unknown author
Millionaire
450602938477581129
QUESTION|453636EE4534345AC5E

If 7 lines duplicate 7 lines, which were already used in my file, duplicate 7 lines was removed.
If 1 (also 2—4) line duplicate 1 line, which were already used in my file, duplicate 1 line doesn't remove. In example words Sasha, Kazan, Chistopol and Katya duplicate, but these words doesn't remove.

4. Did not help

Googling
I find, that Unix commands sort, sed and awk can solve similar tasks, but I don't find, how I can solve my task use these commands.

5. Do not offer

Please, do not offer manually remove each text block. Possibly, I have about a few thousand different duplicate text blocks. Manually removing all duplicates may take a lot of time.

Upvotes: 1

Answers (5)

randomir

Reputation: 18697

Here's a simple solution to your problem (if you have access to GNU sed, sort and uniq):

sed 's/^Millionaire/\x0&/' file | sort -z -k4 | uniq -z -f3 | tr -d '\000'

A little explanation is in order:

since all your blocks begin with the word/line Millionaire, we can use that to split the file in (variably long) blocks by prepending a NUL character to each Millionaire;
then we sort those NUL-separated blocks (using to the -z flag), but ignoring the first 3 fields (in this case lines: Millionaire, \d+, QUESTION|ID...), using the -k/--key option with start position being the field 4 (in your case line 4), and the stop position being the end of the block;
after sorting, we can filter-out the duplicates with uniq, again using the NUL delimiter instead of newline (-z), and ignoring the first 3 fields (with -f/--skip-fields);
finally, we remove NUL delimiters with tr.

In general, solution for removing duplicate blocks like this should work whenever there's a way to split the file into blocks. Note that block-equality can be defined on a subset of fields (as we did above).

Upvotes: 2

Ed Morton

Reputation: 203635

Your requirements aren't clear wrt what to do with overlapping blocks of 5 lines, how to deal with blocks of less than 5 lines at the end of the input, and various other edge cases so here's one way to identify the blocks of 5 (or less) lines that are duplicated:

$ cat tst.awk
{
    for (i=1; i<=5; i++) {
        blockNr = NR - i + 1
        if ( blockNr > 0 ) {
            blocks[blockNr] = (blockNr in blocks ? blocks[blockNr] RS : "") $0
        }
    }
}
END {
    for (blockNr=1; blockNr in blocks; blockNr++) {
        block = blocks[blockNr]
        print "----------- Block", blockNr, (seen[block]++ ? "***** DUP *****" : "ORIG")
        print block
    }
}

$ awk -f tst.awk file
----------- Block 1 ORIG
Who is the greatest Goddess of the world?
Sasha
Kristina
Katya
Valeria
----------- Block 2 ORIG
Sasha
Kristina
Katya
Valeria
Where Sasha live?
----------- Block 3 ORIG
Kristina
Katya
Valeria
Where Sasha live?
St. Petersburg
----------- Block 4 ORIG
Katya
Valeria
Where Sasha live?
St. Petersburg
Kazan
----------- Block 5 ORIG
Valeria
Where Sasha live?
St. Petersburg
Kazan
Novgorod
----------- Block 6 ORIG
Where Sasha live?
St. Petersburg
Kazan
Novgorod
Chistopol
----------- Block 7 ORIG
St. Petersburg
Kazan
Novgorod
Chistopol
Who is the greatest Goddess of the world?
----------- Block 8 ORIG
Kazan
Novgorod
Chistopol
Who is the greatest Goddess of the world?
Sasha
----------- Block 9 ORIG
Novgorod
Chistopol
Who is the greatest Goddess of the world?
Sasha
Kristina
----------- Block 10 ORIG
Chistopol
Who is the greatest Goddess of the world?
Sasha
Kristina
Katya
----------- Block 11 ***** DUP *****
Who is the greatest Goddess of the world?
Sasha
Kristina
Katya
Valeria
----------- Block 12 ORIG
Sasha
Kristina
Katya
Valeria
Another question.
----------- Block 13 ORIG
Kristina
Katya
Valeria
Another question.
Sasha
----------- Block 14 ORIG
Katya
Valeria
Another question.
Sasha
Kazan
----------- Block 15 ORIG
Valeria
Another question.
Sasha
Kazan
Chistopol
----------- Block 16 ORIG
Another question.
Sasha
Kazan
Chistopol
Katya
----------- Block 17 ORIG
Sasha
Kazan
Chistopol
Katya
Where Sasha live?
----------- Block 18 ORIG
Kazan
Chistopol
Katya
Where Sasha live?
St. Petersburg
----------- Block 19 ORIG
Chistopol
Katya
Where Sasha live?
St. Petersburg
Kazan
----------- Block 20 ORIG
Katya
Where Sasha live?
St. Petersburg
Kazan
Novgorod
----------- Block 21 ***** DUP *****
Where Sasha live?
St. Petersburg
Kazan
Novgorod
Chistopol
----------- Block 22 ORIG
St. Petersburg
Kazan
Novgorod
Chistopol
----------- Block 23 ORIG
Kazan
Novgorod
Chistopol
----------- Block 24 ORIG
Novgorod
Chistopol
----------- Block 25 ORIG
Chistopol

and you can build on that to:

print the lines that haven't already been printed from within each ORIG block by using their blockNr plus the current line number in that block (hint: (split(block,lines,RS)) and
figure out how to deal with your unspecified requirements.

Upvotes: 1

CWLiu

Reputation: 4043

Here's an awk+sed method can meet your requirement.

$ sed '0~5 s/$/\n/g' file | awk -v RS= '!($0 in a){a[$0];print}'
Who is the greatest Goddess of the world?
Sasha
Kristina
Katya
Valeria
Where Sasha live?
St. Petersburg
Kazan
Novgorod
Chistopol
Another question.
Sasha
Kazan
Chistopol
Katya

Upvotes: 1

Oasa

Reputation: 417

Please find below code for Windows Power Shell. The code is not in anyway optimized. Please edit test.txt in the below code to the file and make sure the working directory is the one tha. The output is a csv file that you can open in excel sort in order and delete the first column to delete index. I have no idea why those index comes and how to get rid of it. It was my first attempt with Windows Power Shell and I could not find syntax to declare a string array with a fixed size. Neverthless it works.

$d=Get-Content test.txt
$chk=@{};
$tot=$d.Count
$unique=@{}
$g=0;
$isunique=1;
for($i=0;$i -lt $tot){$isunique=1;
$chk[0]=$d[$i]

$chk[1]=$d[$i+1]

$chk[2]=$d[$i+2]

$chk[3]=$d[$i+3]

$chk[4]=$d[$i+4]

$i=$i+5

for($j=0;$j -lt $unique.count){
if($unique[$j] -eq $chk[0]){
if($unique[$j+1] -eq $chk[1]){

if($unique[$j+2] -eq $chk[2]){

if($unique[$j+3] -eq $chk[3]){

if($unique[$j+4] -eq $chk[4]){ 

$isunique=0
break
}
}
}
}
}
$j=$j+5

}



if ($isunique){
$unique[$g]=$chk[0] 

$unique[$g+1]=$chk[1] 
$unique[$g+2]=$chk[2] 
$unique[$g+3]=$chk[3] 
$unique[$g+4]=$chk[4] 
$g=$g+5;

}

}


$unique | out-file test2.csv

![Screenshot] https://i.sstatic.net/6364T.jpg

People with Power Shell experience please optimize the code. I tried .Contains .Add, etc. but did not get the desired result. Hope it helped.

Upvotes: 1

Keith Hall

Reputation: 16075

You can use Sublime Text's find and replace feature with the following regex:

Replace What: \A(?1)*?((^.*$\n){5})(?1)*?\K\1+
Replace With:

(i.e. replace with nothing)

This will find a block of 5 lines that exists later on in the document, and remove that duplicate/second occurrence of those 5 lines (and any immediately adjacent to it), leaving the others (i.e. the original 5 lines that are duplicates, and all other lines) untouched.

Unfortunately, due to the nature of the regex, you will need to perform this operation multiple times to remove all duplicates. It may be easier to keep invoking "Replace" than "Replace All" and having to re-open the panel each time. (Somehow the \K works as expected here, despite a report of it not working with "Replace".)

Upvotes: 1

Delete n duplicate lines in a file

1. Briefly

2. Settings

1. File structure

2. Example content of file

3. Expected behavior

4. Did not help

5. Do not offer

Answers (5)

Related Questions