Yogzzz
Yogzzz

Reputation: 2785

Finding differences between two files in Rails

My current script:

old_ids = []
File.open("missing.txt","w") do |result|
  File.open('query_result.csv', "r") do |old_copy| 
    old_copy.each_line do |line|
      old_ids << line.chomp.to_i #add exisiting ids to array
    end
  end

  File.open('item_info', "r") do |new_copy|
    new_copy.each_line do |line|
      if !old_ids.include? line.split("\t")[0].chomp.to_i #find all ids in new file that do not exist in old file and add them to missing.txt
         result.puts line.split("\t")[0].chomp.to_i
         puts "Adding #{line.split("\t")[0].chomp.to_i}"
          end
        end
      end
    end

How can refactor this to be more efficient. The files I parse contain ~2.6 million records, so as written, this take a really long time to execute.

Upvotes: 0

Views: 982

Answers (3)

Alistair A. Israel
Alistair A. Israel

Reputation: 6567

Lowest hanging fruit? Replace your use of an Array and Array#include with a Hash and Hash#key?:

old_ids = {}
File.open("missing.txt","w") do |result|
  File.foreach('query_result.csv') do |line|
    id = line.chomp.to_i
    old_ids[id] = id
  end

  File.foreach('item_info') do |line|
    id = line.split("\t")[0].chomp.to_i
    unless old_ids.key?(id)
      result.puts id
      puts "Adding #{id}"
    end
  end
end

The reason is simple: Array#include? scans the entire array every time looking for the given value, giving it O(n2) complexity overall. Hash#key?, on the other hand, computes the hash for the given value then performs a lookup to see if the given key is present in the hash table. Amortized time complexity approaches O(n).

A simple test case between the two (finding common lines between two files) yields:

$ time ruby include.rb
real    2m51.409s
user    2m51.246s
sys     0m0.138s

versus

$ time ruby key.rb
real    0m0.092s
user    0m0.082s
sys     0m0.009s

That's with two files of 216 lines each. At 10,000 lines, include still takes 5 seconds, while key? takes all of 29ms.

With 2 million lines, key? takes under 4 seconds. I don't think I can wait around for the Array#include? implementation to finish.

Upvotes: 2

Intrepidd
Intrepidd

Reputation: 20878

Have a look at Diffy, this is a cool gem for calculating diffs between files.

Upvotes: 0

toch
toch

Reputation: 3945

As you want to detect the additional ids, you are not so far from an optimal solution. You can speed up the lookup by building a Set rather than putting your old ids in an array. It's faster.

old_ids = Set.new
File.open("missing.txt","w") do |result|
  File.open('query_result.csv', "r") do |old_copy| 
    old_copy.each_line do |line|
      old_ids.add(line.chomp.to_i) #add exisiting ids to Set
    end
  end

  File.open('item_info', "r") do |new_copy|
    new_copy.each_line do |line|
      if !old_ids.include? line.split("\t")[0].chomp.to_i #test if the id exists in the Set old_ids
         result.puts line.split("\t")[0].chomp.to_i
         puts "Adding #{line.split("\t")[0].chomp.to_i}"
        end
      end
    end
  end
end

I can't test without an example of files. There was one error in your code (old_wmt_ids).

Upvotes: 1

Related Questions