Reputation: 833
I have at least 500 different files containing words (each word in separate line). Problem is that these lists are very long (5bln records total) and I have problem making each list unique. I'd like to preserve filenames but in the same time have unique entries in every file (without merging etc).
So far I tried different programs like app.merge and ccr, DB with unique column in table (postgresql and sqlite) without a luck. Can't find reliable solution. What would be your suggestion to do that?
EDIT: I'm trying to prevent any files from having common words. To explain it better, let's say I have 3 files with following content:
f1:
word1
other
something
f2:
word2
word1
other
f3:
word1
something
myentry
As a result I'd expect to see:
f1:
word1
other
something
f2:
word2
f3:
myentry
Of course files itself are much much bigger (take this one as a example: http://md5decrypt.net/Telecharger-wordlist/Md5decrypt-awesome-wordlist.7z). To answer question 'what I tested so far' - well, here is my code which I'm working on now: https://pastebin.com/Y8HutakU and here is the result (stopped after 1hour of running): https://pastebin.com/tknve7qA. I know the code is far from being optimal and it's clearly visible in output where next insert to DB takes longer and longer as DB is growing. I'm experimenting with DB because I think it will be good solution for having all words unique, preserve filenames and to have comparision method for future use (when I download another wordlist to compare etc). Plus there are good writeups about SQLite performance:
Upvotes: -1
Views: 143
Reputation: 1860
If you're on a Linux system, you could just use standard command line tools.
for file in /path/to/files/*
do
echo "`sort -u $file`" > $file
done
Upvotes: 0