deduplicate multiple large files

Question

I have at least 500 different files containing words (each word in separate line). Problem is that these lists are very long (5bln records total) and I have problem making each list unique. I'd like to preserve filenames but in the same time have unique entries in every file (without merging etc).

So far I tried different programs like app.merge and ccr, DB with unique column in table (postgresql and sqlite) without a luck. Can't find reliable solution. What would be your suggestion to do that?

EDIT: I'm trying to prevent any files from having common words. To explain it better, let's say I have 3 files with following content:

f1:
word1
other
something

f2:
word2
word1
other

f3:
word1
something
myentry

As a result I'd expect to see:

f1:
word1
other
something

f2:
word2

f3:
myentry

Of course files itself are much much bigger (take this one as a example: http://md5decrypt.net/Telecharger-wordlist/Md5decrypt-awesome-wordlist.7z). To answer question 'what I tested so far' - well, here is my code which I'm working on now: https://pastebin.com/Y8HutakU and here is the result (stopped after 1hour of running): https://pastebin.com/tknve7qA. I know the code is far from being optimal and it's clearly visible in output where next insert to DB takes longer and longer as DB is growing. I'm experimenting with DB because I think it will be good solution for having all words unique, preserve filenames and to have comparision method for future use (when I download another wordlist to compare etc). Plus there are good writeups about SQLite performance:

deduplicate multiple large files

Answers (1)

Related Questions