Reputation: 2558
I am working with very large data files extracted from a database. There are duplicates across these files that I need to remove. If there are duplicates they will exist across files not within the same file. The files contain entries that look like the following:
File1
623898/bn-oopi-990iu/I Like Potato
982347/ki-jkhi-767ho/Let's go to Sesame Street
....
File2
568798/jj-ytut-786hh/Hello Mike
982347/ki-jkhi-767ho/Let's go to Sesame Street
....
So the Sesame Street line will have to be removed possibly even across 5 files but at least remain in one of them. From what I have been able to grab so far I can perform the following cat * | sort | uniq -cd
to give me each duplicated line and the number of times they have been duplicated. But have no way of getting the file name. cat * | sort | uniq -cd | grep "" *
doesn't work. Any ideas or approaches for a solution would be great.
Upvotes: 1
Views: 392
Reputation: 28995
twalberg's solution works perfectly but if your files are really large it could exhaust the available memory because it creates one entry in an associative array per encountered unique record. If it happens, you can try a similar approach where there is only one entry per duplicate record (I assume you have GNU awk and your files are named *.txt):
sort *.txt | uniq -d > dup
awk 'BEGIN {while(getline < "dup") {dup[$0] = 1}} \
!($0 in dup) {print >> (FILENAME ".new")} \
$0 in dup {if(dup[$0] == 1) {print >> (FILENAME ".new");dup[$0] = 0}}' *.txt
Note that if you have many duplicates it could also exhaust the available memory. You can solve this by splitting the dup
file in smaller chunks and run the awk
script on each chunk.
Upvotes: 0
Reputation: 62369
Something along these lines might be useful:
awk '!seen[$0] { print $0 > FILENAME ".new" } { seen[$0] = 1 }' file1 file2 file3 ...
Upvotes: 1
Reputation: 241828
Expanding your original idea:
sort * | uniq -cd | awk '{print $2}' | grep -Ff- *
i.e. form the output, print only the duplicate strings, then search all the files for them (list of things to search from taken form -
, i.e. stdin), literally (-F
).
Upvotes: 1