Reputation: 2813
I have a large text file (14MB). I need to remove in a file text blocks, contains 5 duplicate lines.
It would be nice, if would be possible make it use any gratis method.
I use Windows, but Cygwin solutions also would be nice.
I have a file test1.md
. It consists of repeating blocks. Each block has a 10 lines. Structure of file (using PCRE regular expressions)
Millionaire
\d{18}
QUESTION.*
.*
.*
.*
.*
.*
.*
.*
Millionaire
\d{18}
QUESTION.*
.*
.*
.*
.*
.*
.*
.*
test1.md
doesn't have another lines and text besides 10-lines blocks. It doesn't have blank lines and blocks with a greater or lesser number of lines than 10.
Millionaire
123456788763237476
QUESTION|2402394827049882049
Who is the greatest Goddess of the world?
Sasha
Kristina
Sasha
Katya
Valeria
AuthorOfQuestion
Millionaire
459385734954395394
QUESTION|9845495845948594999
Where Sasha live?
Novgorod
St. Petersburg
Kazan
Novgorod
Chistopol
Another author
Millionaire
778845225202502505
QUESTION|984ACFBBADD8594999A
Who is the greatest Goddess of the world?
Sasha
Kristina
Sasha
Katya
Valeria
Millionaire
AuthorOfQuestion
Millionaire
903034225025025568
QUESTION|ABC121980850540445C
Another question.
Katya
Sasha
Kazan
Chistopol
Katya
Unknown author
Millionaire
450602938477581129
QUESTION|453636EE4534345AC5E
Where Sasha live?
Novgorod
St. Petersburg
Kazan
Novgorod
Chistopol
Another author
As can be seen in the example, test1.md
has repeated 7-lines blocks. In example, these blocks is:
Who is the greatest Goddess of the world?
Sasha
Kristina
Sasha
Katya
Valeria
AuthorOfQuestion
and
Where Sasha live?
Novgorod
St. Petersburg
Kazan
Novgorod
Chistopol
Another author
I need to remove all repeat blocks. In my example I need to get:
Millionaire
123456788763237476
QUESTION|2402394827049882049
Who is the greatest Goddess of the world?
Sasha
Kristina
Sasha
Katya
Valeria
AuthorOfQuestion
Millionaire
459385734954395394
QUESTION|9845495845948594999
Where Sasha live?
Novgorod
St. Petersburg
Kazan
Novgorod
Chistopol
Another author
Millionaire
778845225202502505
QUESTION|984ACFBBADD8594999A
Millionaire
903034225025025568
QUESTION|ABC121980850540445C
Another question.
Katya
Sasha
Kazan
Chistopol
Katya
Unknown author
Millionaire
450602938477581129
QUESTION|453636EE4534345AC5E
Sasha
, Kazan
, Chistopol
and Katya
duplicate, but these words doesn't remove.sort
, sed
and awk
can solve similar tasks, but I don't find, how I can solve my task use these commands.Upvotes: 1
Views: 297
Reputation: 18697
Here's a simple solution to your problem (if you have access to GNU sed
, sort
and uniq
):
sed 's/^Millionaire/\x0&/' file | sort -z -k4 | uniq -z -f3 | tr -d '\000'
A little explanation is in order:
Millionaire
, we can use that to split the file in (variably long) blocks by prepending a NUL
character to each Millionaire
;NUL
-separated blocks (using to the -z
flag), but ignoring the first 3 fields (in this case lines: Millionaire
, \d+
, QUESTION|ID...
), using the -k
/--key
option with start position being the field 4
(in your case line 4), and the stop position being the end of the block;uniq
, again using the NUL
delimiter instead of newline (-z
), and ignoring the first 3 fields (with -f
/--skip-fields
);NUL
delimiters with tr
.In general, solution for removing duplicate blocks like this should work whenever there's a way to split the file into blocks. Note that block-equality can be defined on a subset of fields (as we did above).
Upvotes: 2
Reputation: 203635
Your requirements aren't clear wrt what to do with overlapping blocks of 5 lines, how to deal with blocks of less than 5 lines at the end of the input, and various other edge cases so here's one way to identify the blocks of 5 (or less) lines that are duplicated:
$ cat tst.awk
{
for (i=1; i<=5; i++) {
blockNr = NR - i + 1
if ( blockNr > 0 ) {
blocks[blockNr] = (blockNr in blocks ? blocks[blockNr] RS : "") $0
}
}
}
END {
for (blockNr=1; blockNr in blocks; blockNr++) {
block = blocks[blockNr]
print "----------- Block", blockNr, (seen[block]++ ? "***** DUP *****" : "ORIG")
print block
}
}
.
$ awk -f tst.awk file
----------- Block 1 ORIG
Who is the greatest Goddess of the world?
Sasha
Kristina
Katya
Valeria
----------- Block 2 ORIG
Sasha
Kristina
Katya
Valeria
Where Sasha live?
----------- Block 3 ORIG
Kristina
Katya
Valeria
Where Sasha live?
St. Petersburg
----------- Block 4 ORIG
Katya
Valeria
Where Sasha live?
St. Petersburg
Kazan
----------- Block 5 ORIG
Valeria
Where Sasha live?
St. Petersburg
Kazan
Novgorod
----------- Block 6 ORIG
Where Sasha live?
St. Petersburg
Kazan
Novgorod
Chistopol
----------- Block 7 ORIG
St. Petersburg
Kazan
Novgorod
Chistopol
Who is the greatest Goddess of the world?
----------- Block 8 ORIG
Kazan
Novgorod
Chistopol
Who is the greatest Goddess of the world?
Sasha
----------- Block 9 ORIG
Novgorod
Chistopol
Who is the greatest Goddess of the world?
Sasha
Kristina
----------- Block 10 ORIG
Chistopol
Who is the greatest Goddess of the world?
Sasha
Kristina
Katya
----------- Block 11 ***** DUP *****
Who is the greatest Goddess of the world?
Sasha
Kristina
Katya
Valeria
----------- Block 12 ORIG
Sasha
Kristina
Katya
Valeria
Another question.
----------- Block 13 ORIG
Kristina
Katya
Valeria
Another question.
Sasha
----------- Block 14 ORIG
Katya
Valeria
Another question.
Sasha
Kazan
----------- Block 15 ORIG
Valeria
Another question.
Sasha
Kazan
Chistopol
----------- Block 16 ORIG
Another question.
Sasha
Kazan
Chistopol
Katya
----------- Block 17 ORIG
Sasha
Kazan
Chistopol
Katya
Where Sasha live?
----------- Block 18 ORIG
Kazan
Chistopol
Katya
Where Sasha live?
St. Petersburg
----------- Block 19 ORIG
Chistopol
Katya
Where Sasha live?
St. Petersburg
Kazan
----------- Block 20 ORIG
Katya
Where Sasha live?
St. Petersburg
Kazan
Novgorod
----------- Block 21 ***** DUP *****
Where Sasha live?
St. Petersburg
Kazan
Novgorod
Chistopol
----------- Block 22 ORIG
St. Petersburg
Kazan
Novgorod
Chistopol
----------- Block 23 ORIG
Kazan
Novgorod
Chistopol
----------- Block 24 ORIG
Novgorod
Chistopol
----------- Block 25 ORIG
Chistopol
and you can build on that to:
split(block,lines,RS)
) andUpvotes: 1
Reputation: 4043
Here's an awk
+sed
method can meet your requirement.
$ sed '0~5 s/$/\n/g' file | awk -v RS= '!($0 in a){a[$0];print}'
Who is the greatest Goddess of the world?
Sasha
Kristina
Katya
Valeria
Where Sasha live?
St. Petersburg
Kazan
Novgorod
Chistopol
Another question.
Sasha
Kazan
Chistopol
Katya
Upvotes: 1
Reputation: 417
Please find below code for Windows Power Shell. The code is not in anyway optimized. Please edit test.txt in the below code to the file and make sure the working directory is the one tha. The output is a csv file that you can open in excel sort in order and delete the first column to delete index. I have no idea why those index comes and how to get rid of it. It was my first attempt with Windows Power Shell and I could not find syntax to declare a string array with a fixed size. Neverthless it works.
$d=Get-Content test.txt
$chk=@{};
$tot=$d.Count
$unique=@{}
$g=0;
$isunique=1;
for($i=0;$i -lt $tot){$isunique=1;
$chk[0]=$d[$i]
$chk[1]=$d[$i+1]
$chk[2]=$d[$i+2]
$chk[3]=$d[$i+3]
$chk[4]=$d[$i+4]
$i=$i+5
for($j=0;$j -lt $unique.count){
if($unique[$j] -eq $chk[0]){
if($unique[$j+1] -eq $chk[1]){
if($unique[$j+2] -eq $chk[2]){
if($unique[$j+3] -eq $chk[3]){
if($unique[$j+4] -eq $chk[4]){
$isunique=0
break
}
}
}
}
}
$j=$j+5
}
if ($isunique){
$unique[$g]=$chk[0]
$unique[$g+1]=$chk[1]
$unique[$g+2]=$chk[2]
$unique[$g+3]=$chk[3]
$unique[$g+4]=$chk[4]
$g=$g+5;
}
}
$unique | out-file test2.csv
![Screenshot] https://i.sstatic.net/6364T.jpg
People with Power Shell experience please optimize the code. I tried .Contains .Add, etc. but did not get the desired result. Hope it helped.
Upvotes: 1
Reputation: 16075
You can use Sublime Text's find and replace feature with the following regex:
\A(?1)*?((^.*$\n){5})(?1)*?\K\1+
(i.e. replace with nothing)
This will find a block of 5 lines that exists later on in the document, and remove that duplicate/second occurrence of those 5 lines (and any immediately adjacent to it), leaving the others (i.e. the original 5 lines that are duplicates, and all other lines) untouched.
Unfortunately, due to the nature of the regex, you will need to perform this operation multiple times to remove all duplicates. It may be easier to keep invoking "Replace" than "Replace All" and having to re-open the panel each time. (Somehow the \K
works as expected here, despite a report of it not working with "Replace".)
Upvotes: 1