user987266
user987266

Reputation:

Improve speed of the file search in Ruby

Given a directory with about 100 000 small files (each files is about 1kB). I need to get list of these files and iterate over it in order to find files with the same name but different case (the files are on Linux ext4 FS). Currently, I use some code like this:

   def similar_files_in_folder(file_path, folder, exclude_folders = false)
     files = Dir.glob(file_path, File::FNM_CASEFOLD)
     files_set = files.select{|f| f.start_with?(folder)}
     return files_set unless exclude_folders
     files_set.reject{|entry| File.directory? entry}
   end

   dir_entries = Dir.entries(@directory) - ['.', '..']
   dir_entries.map do |file_name|
     similar_files_in_folder(file_name, @directory)
   end

The issue with this approach is that the snippet takes a lot!!! of time to finish. It is about some hours on my system.

Is there another way to achieve the same goal but much faster in Ruby?

Limitation: I can't load the file list in memory and then just compare the names in down case, because in the @directory new files are appear. So, I need to scan the @directory on each iteration.

Thanks for any hint.

Upvotes: 1

Views: 934

Answers (2)

icy
icy

Reputation: 934

What I meant by my comment was that you could search for a string as you traverse the filesystem, instead of first building up a huge array of all possible files and only then searching. I wrote something similar to a linux find <path> | grep --color -i <pattern> , except highlighting the pattern only in basename:

require 'find'

#find files whose basename matches a pattern (and output results to console)
def find_similar(s, opts={})
  #by default, path is '.', case insensitive, no bash terminal coloring
  opts[:verbose] ||= false
  opts[:path] ||= '.'
  opts[:insensitive]=true if opts[:insensitive].nil?
  opts[:color]||=false
  boldred = "\e[1m\e[31m\\1\e[0m"    #contains an escaped \1 for regex

  puts "searching for \"#{s}\" in \"#{opts[:path]}\", insensitive=#{opts[:insensitive]}..." if opts[:verbose]
  reg = opts[:insensitive] ? /(#{s})/i : /(#{s})/
  dir,base = '',''
  Find.find(opts[:path]) {|path|
    dir,base = File.dirname(path), File.basename(path)
    if base =~ reg
      if opts[:color]
        puts "#{dir}/#{base.gsub(reg, boldred)}"
      else
        puts path
      end
    end
  }
end

time = Time.now
#find_similar('LOg', :color=>true)    #similar to   find . | grep --color -i LOg
find_similar('pYt', :path=>'c:/bin/sublime3/', :color=>true, :verbose=>true)
puts "search took #{Time.now-time}sec"

example output (cygwin), but also works if run from cmd.exe example output

Upvotes: 1

Stefan
Stefan

Reputation: 114178

If I understand your code correctly, this already returns an array of all those 100k filenames:

dir_entries = Dir.entries(@directory) - ['.', '..']
#=> ["foo.txt", "bar.txt", "BAR.txt", ...]

I would group this array by the lowercase filename:

dir_entries.group_by(&:downcase)
#=> {"foo.txt"=>["foo.txt"], "bar.txt"=>["bar.txt", "BAR.txt"], ... }

And select the ones with more than 1 occurrences:

dir_entries.group_by(&:downcase).select { |k, v| v.size > 1 }
#=> {"bar.txt"=>["bar.txt", "BAR.txt"], ...}

Upvotes: 2

Related Questions