Max
Max

Reputation: 629

How to speed up code comparing file sizes/names?

I've got two main file servers and a big backup server, but someone has misorganized the backup server over some time now, and I need to check to make sure there are no files on the backup server that aren't on the main servers.

So I thought I'd write some quick code in Ruby to do so, which just uses a list of all files on each drive (found using File.glob) and checks for the existence of files on the main drives with File.size and File.basename.

Problem is it takes a while!! Each file between the main drives compared to the backup drive takes ~0.8s each, and given a drive with hundreds of thousands of files, this isn't going to work.

Any suggestions? I'm assuming my way is very inefficient.

Upvotes: 1

Views: 45

Answers (2)

lobati
lobati

Reputation: 10215

Dir.glob returns an Array, so you'll end up needing to scan the full list of files for each file you're searching for. If you've got 100,000 files, that means you'll be doing 100,000^2 operations. You might speed things up quite a bit by instead incorporating a Set, which has constant time access, reducing the workload to 100,000 operations. You can try something like this:

require 'set'
files_to_search = Set.new(Dir.glob('/that/path/**/*'))
files_to_search.include?('foo')

You might also be running into other constraints, however, such as memory, or the fact that Ruby isn't comparatively all that fast, so if Set doesn't do the trick, you might want to try something using a shell tool. Michał Młoźniak's rsync solution might do the trick, or you could probably come up with a handful of other ways to patch together shell commands and get the information you're looking for. You could check out diff for example, perhaps paired with find.

Upvotes: 0

Michał Młoźniak
Michał Młoźniak

Reputation: 5556

Forget ruby, just read manual for rsync command. You can use dry-run or other mix of options to just compare both main directories without copying files. It will be much faster, in terms of execution and time spent on making this work.

Upvotes: 2

Related Questions