How to remove duplicate lines on HUGE text file in batch

Question

I have a text file that is over 50GB. It contains many lines, each line is on average around 15 characters. I want each line to be unique (case sensitive). So if a line is exactly the same as another one, it must be removed, without changing the order of the other lines or sorting the file in any way.

My question is different from others because I have a huge file that cannot be handled with other solutions that I searched.

I have tried:

awk !seen[$0]++ bigtextfile.txt > dublicatesremoved.txt

it starts nice and fast but very soon I get the following error:

awk: (FILENAME=bigtextfile.txt FNR=19083509) fatal: more_nodes: nextfree: can't allocate 4000 bytes of memory (Not enough space)

The above error appears when the output file is about 200MB.

Is there any other fast way that I can do the same thing on windows?

Ed Morton · Accepted Answer

You could do this on a UNIX box or Cygwin on top of Windows:

$ cat file
Speed, bonnie boat, like a bird on the wing,
Onward! the sailors cry;
Carry the lad that's born to be King
Over the sea to Skye.

Loud the winds howl, loud the waves roar,
Speed, bonnie boat, like a bird on the wing,
Thunderclaps rend the air;
Onward! the sailors cry;
Baffled, our foes stand by the shore,
Carry the lad that's born to be King
Follow they will not dare.
Over the sea to Skye.

.

$ cat -n file | sort -k2 -u | sort -n | cut -f2-
Speed, bonnie boat, like a bird on the wing,
Onward! the sailors cry;
Carry the lad that's born to be King
Over the sea to Skye.

Loud the winds howl, loud the waves roar,
Thunderclaps rend the air;
Baffled, our foes stand by the shore,
Follow they will not dare.

The only command above trying to process the whole file at once is sort and sort is designed to use paging, etc. to handle exactly that for large files (see https://unix.stackexchange.com/q/279096/133219) so IMHO it's your best shot at being able to do this.

Start with the cat -n file and then add each command to the pipeline one at a time to see what it's doing (see below) but it's just adding line numbers first so we can then sort uniquely by content to get the unique values and then sort by the original line numbers to get the original line order back and then remove the line numbers we added at the first step:

$ cat -n file
     1  Speed, bonnie boat, like a bird on the wing,
     2  Onward! the sailors cry;
     3  Carry the lad that's born to be King
     4  Over the sea to Skye.
     5
     6  Loud the winds howl, loud the waves roar,
     7  Speed, bonnie boat, like a bird on the wing,
     8  Thunderclaps rend the air;
     9  Onward! the sailors cry;
    10  Baffled, our foes stand by the shore,
    11  Carry the lad that's born to be King
    12  Follow they will not dare.
    13  Over the sea to Skye.
    14

.

$ cat -n file | sort -k2 -u
     5
    10  Baffled, our foes stand by the shore,
     3  Carry the lad that's born to be King
    12  Follow they will not dare.
     6  Loud the winds howl, loud the waves roar,
     2  Onward! the sailors cry;
     4  Over the sea to Skye.
     1  Speed, bonnie boat, like a bird on the wing,
     8  Thunderclaps rend the air;

.

$ cat -n file | sort -k2 -u | sort -n
     1  Speed, bonnie boat, like a bird on the wing,
     2  Onward! the sailors cry;
     3  Carry the lad that's born to be King
     4  Over the sea to Skye.
     5
     6  Loud the winds howl, loud the waves roar,
     8  Thunderclaps rend the air;
    10  Baffled, our foes stand by the shore,
    12  Follow they will not dare.

.

$ cat -n file | sort -k2 -u | sort -n | cut -f2-
Speed, bonnie boat, like a bird on the wing,
Onward! the sailors cry;
Carry the lad that's born to be King
Over the sea to Skye.

Loud the winds howl, loud the waves roar,
Thunderclaps rend the air;
Baffled, our foes stand by the shore,
Follow they will not dare.

How to remove duplicate lines on HUGE text file in batch

Answers (1)

Related Questions