Fast removing duplicate rows between multiple files

Question

I have 10k files with 80k rows each and need to compare, and - either delete the duplicate lines or replace them by "0". ultrafast since I have to do it +1000 times.

the following script is fast enough for files with less than 100 rows. now tcsh

import csv
foreach file ( `ls -1 *` )
split -l 1 ${file} ${file}.
end
find *.* -type f -print0 | xargs -0 sha512sum | awk '($1 in aa){print $2}(!($1 in  aa)){aa[$1]=$2}' | xargs -I {} cp rowzero {} 
cat ${file}.* > ${file}.filtered

where "rowzero" is just a file with a... zero. I have tried python but haven't found a fast way. I have tried pasting them and doing all nice fast things (awk, sed, above commands, etc.) but the i/o slows to incredible levels when the file has over more than e.g. 1000 columns. I need help, thanks a million hours!.

Fast removing duplicate rows between multiple files

Answers (1)

Related Questions