NastyDiaper
NastyDiaper

Reputation: 2558

Finding duplicate entries across very large text files in bash

I am working with very large data files extracted from a database. There are duplicates across these files that I need to remove. If there are duplicates they will exist across files not within the same file. The files contain entries that look like the following:

File1

 623898/bn-oopi-990iu/I Like Potato
 982347/ki-jkhi-767ho/Let's go to Sesame Street
 ....


File2

 568798/jj-ytut-786hh/Hello Mike
 982347/ki-jkhi-767ho/Let's go to Sesame Street
 ....

So the Sesame Street line will have to be removed possibly even across 5 files but at least remain in one of them. From what I have been able to grab so far I can perform the following cat * | sort | uniq -cd to give me each duplicated line and the number of times they have been duplicated. But have no way of getting the file name. cat * | sort | uniq -cd | grep "" * doesn't work. Any ideas or approaches for a solution would be great.

Upvotes: 1

Views: 392

Answers (3)

Renaud Pacalet
Renaud Pacalet

Reputation: 28995

twalberg's solution works perfectly but if your files are really large it could exhaust the available memory because it creates one entry in an associative array per encountered unique record. If it happens, you can try a similar approach where there is only one entry per duplicate record (I assume you have GNU awk and your files are named *.txt):

sort *.txt | uniq -d > dup
awk 'BEGIN {while(getline < "dup") {dup[$0] = 1}} \
!($0 in dup) {print >> (FILENAME ".new")} \
$0 in dup {if(dup[$0] == 1) {print >> (FILENAME ".new");dup[$0] = 0}}' *.txt

Note that if you have many duplicates it could also exhaust the available memory. You can solve this by splitting the dup file in smaller chunks and run the awk script on each chunk.

Upvotes: 0

twalberg
twalberg

Reputation: 62369

Something along these lines might be useful:

awk '!seen[$0] { print $0 > FILENAME ".new" } { seen[$0] = 1 }' file1 file2 file3 ...

Upvotes: 1

choroba
choroba

Reputation: 241828

Expanding your original idea:

sort * | uniq -cd | awk '{print $2}' | grep -Ff- *

i.e. form the output, print only the duplicate strings, then search all the files for them (list of things to search from taken form -, i.e. stdin), literally (-F).

Upvotes: 1

Related Questions