c.a.p.
c.a.p.

Reputation: 21

Fast removing duplicate rows between multiple files

I have 10k files with 80k rows each and need to compare, and - either delete the duplicate lines or replace them by "0". ultrafast since I have to do it +1000 times.

the following script is fast enough for files with less than 100 rows. now tcsh

import csv
foreach file ( `ls -1 *` )
split -l 1 ${file} ${file}.
end
find *.* -type f -print0 | xargs -0 sha512sum | awk '($1 in aa){print $2}(!($1 in  aa)){aa[$1]=$2}' | xargs -I {} cp rowzero {} 
cat ${file}.* > ${file}.filtered

where "rowzero" is just a file with a... zero. I have tried python but haven't found a fast way. I have tried pasting them and doing all nice fast things (awk, sed, above commands, etc.) but the i/o slows to incredible levels when the file has over more than e.g. 1000 columns. I need help, thanks a million hours!.

Upvotes: 1

Views: 328

Answers (1)

c.a.p.
c.a.p.

Reputation: 21

ok this is so far the fastest code that I could make, which works on a transposed and "cat" input. As explained before, "cat"-ed input ">>" works fine however "paste" or "pr" code gives nightmares pasting another column in, say, +1GB files, and that is why we need to transpose. e.g. each original file looks like

1 
2 
3 
4 

... if we transpose and cat the first file with others the input for the code will look like:

1 2 3 4 .. 
1 1 2 4 .. 
1 1 1 4 .. 

The code will return the original "aka retransposed pasted" format with the minor detail of shuffled rows

1 
1 2 
1 2 3
2 3 4
..

The repeated rows were effectively removed. below the code,

HOWEVER THE CODE IS NOT GENERAL! it only works with 1-digit integers since the awk array indexes are not sorted. Could someone help to generalize it? thanks!

{for(ii=1;ii<=NF;ii++){aa[ii,$ii]=$ii}}END{mm=1; for (n in aa) {split(n, bb, SUBSEP); if (bb[1]==mm){cc=bb[2]; printf ( "%2s", cc)}else{if (mm!=bb[1]){printf "\n%2s", bb[2] }; mm=bb[1]}}}

Upvotes: 1

Related Questions