Reputation: 2460
I'm attempting to eliminate duplicate files from a filesystem with around 12,000 decent-size (150+ MB) files. I expect 20-50 duplicates in the set.
Rather than do a checksum on every single file, which is relatively demanding, my idea was to build a hash listing every file and its filesize, eliminate entries where the filesize is unique, and only do a checksum on the remainders, saving a lot of time.
However I'm having a bit of trouble stripping the hash down to just the unique entries. I tried, where files
is a hash like super_cool_map.png => 1073741824,
:
uniques = files.values.uniq
dupes = files.delete_if do |k,v|
uniques.include?(v)
end
puts dupes
But that only outputs a blank hash. What should I do?
Upvotes: 0
Views: 63
Reputation: 118271
How is this ?
# this will give the grouped same size files as an array.
files.group_by(&:last).map { |_, v| v.map(&:first) if v.size > 1 }.compact
Upvotes: 2
Reputation: 59611
Why not reverse the mapping? Make the keys the file sizes, and the value a list
of file names. That way you get "grouping by size" for free.
Then you can filter your hash by using
my_hash = {30323 => ["file1", "file2"], 233 => ["file3"]}
filtered = my_hash.select{ |k, v|
v.size > 1
}
p filtered # prints {30323 => ["file1", "file2"]}
Now you have a hash where each key corresponds to a list of files you need to hash and compare to each other.
Upvotes: 2