Reputation: 774
Lets say I have 4 folders with 25 folders in each. In each of those 25 folders there is 20 folders each with 1 very long text document. The method i'm using now seems to have room to improve and in every scenario in which I implement ruby's threads, the result is slower than before. I have an array of the 54names of the folders. I iterate through each and use a foreach method to get the deeply nested files. In the foreach loop I do 3 things. I get the contents of today's file, I get the contents of yesterday's file, and I use my diff algorithm to find what has changed from yesterday to today. How would you do this faster with threads.
def backup_differ_loop device_name
device_name.strip!
Dir.foreach("X:/Backups/#{device_name}/#{@today}").each do |backup|
if backup != "." and backup != ".."
@today_filename = "X:/Backups/#{device_name}/#{@today}/#{backup}"
@yesterday_filename = "X:/Backups/#{device_name}/#{@yesterday}/#{backup.gsub(@today, @yesterday)}"
if File.exists?(@yesterday_filename)
today_backup_content = File.open(@today_filename, "r").read
yesterday_backup_content = File.open(@yesterday_filename, "r").read
begin
Diffy::Diff.new(yesterday_backup_content, today_backup_content, :include_plus_and_minus_in_html => true, :context => 1).to_s(:html)
rescue
#do nothing just continue
end
end
else
#file not found
end
end
end
Upvotes: 0
Views: 88
Reputation: 22315
Most Ruby implementations don't have "true" multicore threading i.e. threads won't gain you any performance improvement since the interpreter can only run one thread at a time. For applications like yours with lots of disk IO this is especially true. In fact, even with real multithreading your applications might be IO-bound and still not see much of an improvement.
You are more likely to get results by finding some inefficient algorithm in your code and improving it.
Upvotes: 0
Reputation: 8424
The first part of your logic is finding all files in a specific folder. Instead of doing Dir.foreach and then checking against "." and ".." you can do this in one line:
files = Dir.glob("X:/Backups/#{device_name}/#{@today}/*").select { |item| File.file?(item)}
Notice the /*
at the end? This will search 1 level deep (inside the @today folder). If you want to search inside sub-folders too, replace it with /**/*
so you'll get array of all files inside all sub-folders of @today.
So I'd first have a method which would give me a double array containing a bunch of arrays of matching files:
def get_matching_files
matching_files = []
Dir.glob("X:/Backups/#{device_name}/#{@today}/*").select { |item| File.file?(item)}.each do |backup|
today_filename = File.absolute_path(backup) # should get you X:/Backups...converts to an absolute path
yesterday_filename = "X:/Backups/#{device_name}/#{@yesterday}/#{backup.gsub(@today, @yesterday)}"
if File.exists?(yesterday_filename)
matching_files << [today_filename, yesterday_filename]
end
end
return matching_files
end
and call it:
matching_files = get_matching_files
NOW we can start the multi-threading which is where things probably slow down. I'd first get all the files from the array matching_files into a queue, then start 5 threads which will go until the queue is empty:
queue = Queue.new
matching_files.each { |file| queue << file }
# 5 being the number of threads
5.times.map do
Thread.new do
until queue.empty?
begin
today_file_content, yesterday_file_content = queue.pop
Diffy::Diff.new(yesterday_backup_content, today_backup_content, :include_plus_and_minus_in_html => true, :context => 1).to_s(:html)
rescue
#do nothing just continue
end
end
end
end.each(&:join)
I can't guarantee my code will work because I don't have the entire context of your program. I hope I've given you some ideas.
And the MOST important thing: The standard implementation of Ruby can run only 1 thread at a time. This means even if you implement the code above, you won't get a significant performance difference. So get Rubinius or JRuby which allow more than 1 threads to be running at a time. Or if you prefer to use the standard MRI Ruby, then you'll need to re-structure your code (you can keep your original version) and start multiple processes. You'll just need something like a shared database where you can store the matching_files (as a single row, for example) and every time a process will 'take' something from that database, it will mark that row as 'used'. SQLite is a good db for this I think because it's thread safe by default.
Upvotes: 2