Ruby threading vs normal

Question

Lets say I have 4 folders with 25 folders in each. In each of those 25 folders there is 20 folders each with 1 very long text document. The method i'm using now seems to have room to improve and in every scenario in which I implement ruby's threads, the result is slower than before. I have an array of the 54names of the folders. I iterate through each and use a foreach method to get the deeply nested files. In the foreach loop I do 3 things. I get the contents of today's file, I get the contents of yesterday's file, and I use my diff algorithm to find what has changed from yesterday to today. How would you do this faster with threads.

def backup_differ_loop device_name

  device_name.strip!
  Dir.foreach("X:/Backups/#{device_name}/#{@today}").each do |backup|

    if backup != "." and backup != ".."
      @today_filename = "X:/Backups/#{device_name}/#{@today}/#{backup}"
      @yesterday_filename = "X:/Backups/#{device_name}/#{@yesterday}/#{backup.gsub(@today, @yesterday)}"

      if File.exists?(@yesterday_filename)
        today_backup_content = File.open(@today_filename, "r").read
        yesterday_backup_content = File.open(@yesterday_filename, "r").read

        begin
         Diffy::Diff.new(yesterday_backup_content, today_backup_content, :include_plus_and_minus_in_html => true, :context => 1).to_s(:html)
        rescue
         #do nothing just continue
        end

        end

      else
       #file not found
      end

    end

  end

daremkd · Accepted Answer

The first part of your logic is finding all files in a specific folder. Instead of doing Dir.foreach and then checking against "." and ".." you can do this in one line:

files = Dir.glob("X:/Backups/#{device_name}/#{@today}/*").select { |item| File.file?(item)}

Notice the /* at the end? This will search 1 level deep (inside the @today folder). If you want to search inside sub-folders too, replace it with /**/* so you'll get array of all files inside all sub-folders of @today.

So I'd first have a method which would give me a double array containing a bunch of arrays of matching files:

def get_matching_files
  matching_files = []

  Dir.glob("X:/Backups/#{device_name}/#{@today}/*").select { |item| File.file?(item)}.each do |backup|
    today_filename = File.absolute_path(backup) # should get you X:/Backups...converts to an absolute path
    yesterday_filename = "X:/Backups/#{device_name}/#{@yesterday}/#{backup.gsub(@today, @yesterday)}"

    if File.exists?(yesterday_filename)
      matching_files << [today_filename, yesterday_filename]
    end
  end

  return matching_files
end

and call it:

matching_files = get_matching_files

NOW we can start the multi-threading which is where things probably slow down. I'd first get all the files from the array matching_files into a queue, then start 5 threads which will go until the queue is empty:

queue = Queue.new
matching_files.each { |file| queue << file }

# 5 being the number of threads
5.times.map do
  Thread.new do
    until queue.empty?
      begin
        today_file_content, yesterday_file_content = queue.pop
        Diffy::Diff.new(yesterday_backup_content, today_backup_content, :include_plus_and_minus_in_html => true, :context => 1).to_s(:html)
      rescue
        #do nothing just continue
      end
    end
  end
end.each(&:join)

I can't guarantee my code will work because I don't have the entire context of your program. I hope I've given you some ideas.

And the MOST important thing: The standard implementation of Ruby can run only 1 thread at a time. This means even if you implement the code above, you won't get a significant performance difference. So get Rubinius or JRuby which allow more than 1 threads to be running at a time. Or if you prefer to use the standard MRI Ruby, then you'll need to re-structure your code (you can keep your original version) and start multiple processes. You'll just need something like a shared database where you can store the matching_files (as a single row, for example) and every time a process will 'take' something from that database, it will mark that row as 'used'. SQLite is a good db for this I think because it's thread safe by default.

Ruby threading vs normal

Answers (2)

Related Questions