Reputation: 2785
My current script:
old_ids = []
File.open("missing.txt","w") do |result|
File.open('query_result.csv', "r") do |old_copy|
old_copy.each_line do |line|
old_ids << line.chomp.to_i #add exisiting ids to array
end
end
File.open('item_info', "r") do |new_copy|
new_copy.each_line do |line|
if !old_ids.include? line.split("\t")[0].chomp.to_i #find all ids in new file that do not exist in old file and add them to missing.txt
result.puts line.split("\t")[0].chomp.to_i
puts "Adding #{line.split("\t")[0].chomp.to_i}"
end
end
end
end
How can refactor this to be more efficient. The files I parse contain ~2.6 million records, so as written, this take a really long time to execute.
Upvotes: 0
Views: 982
Reputation: 6567
Lowest hanging fruit? Replace your use of an Array
and Array#include
with a Hash
and Hash#key?
:
old_ids = {}
File.open("missing.txt","w") do |result|
File.foreach('query_result.csv') do |line|
id = line.chomp.to_i
old_ids[id] = id
end
File.foreach('item_info') do |line|
id = line.split("\t")[0].chomp.to_i
unless old_ids.key?(id)
result.puts id
puts "Adding #{id}"
end
end
end
The reason is simple: Array#include?
scans the entire array every time looking for the given value, giving it O(n2) complexity overall. Hash#key?
, on the other hand, computes the hash for the given value then performs a lookup to see if the given key is present in the hash table. Amortized time complexity approaches O(n).
A simple test case between the two (finding common lines between two files) yields:
$ time ruby include.rb
real 2m51.409s
user 2m51.246s
sys 0m0.138s
versus
$ time ruby key.rb
real 0m0.092s
user 0m0.082s
sys 0m0.009s
That's with two files of 216 lines each. At 10,000 lines, include
still takes 5 seconds, while key?
takes all of 29ms.
With 2 million lines, key?
takes under 4 seconds. I don't think I can wait around for the Array#include?
implementation to finish.
Upvotes: 2
Reputation: 20878
Have a look at Diffy, this is a cool gem for calculating diffs between files.
Upvotes: 0
Reputation: 3945
As you want to detect the additional ids, you are not so far from an optimal solution. You can speed up the lookup by building a Set rather than putting your old ids in an array. It's faster.
old_ids = Set.new
File.open("missing.txt","w") do |result|
File.open('query_result.csv', "r") do |old_copy|
old_copy.each_line do |line|
old_ids.add(line.chomp.to_i) #add exisiting ids to Set
end
end
File.open('item_info', "r") do |new_copy|
new_copy.each_line do |line|
if !old_ids.include? line.split("\t")[0].chomp.to_i #test if the id exists in the Set old_ids
result.puts line.split("\t")[0].chomp.to_i
puts "Adding #{line.split("\t")[0].chomp.to_i}"
end
end
end
end
end
I can't test without an example of files. There was one error in your code (old_wmt_ids).
Upvotes: 1