Reputation: 4877
We have two lists, a list of events each with an id
, a start_time
, and a start_time_rage
. The start_time_range
puts a tolerance around start_time
to find near misses.
The objective is to filter the current_matches
, to only include those that appear in the previous matches. An item "appears" in the list if the id
matches and the start_time
is within the range.
To achieve this, I have this loop, but with our ever increasing data size it is becoming very slow. I need to optimise it:
current_matches.select! do |match_row|
previous_matches_collection.any? do |previous_match|
previous_match[:item_id] == match_row[:item_id] &&
previous_match[:start_time_range].include?(match_row[:start_time].to_f)
end
end
If it were just the item_id I needed I could do this:
previous_ids = previous_matches_collection.collect{|i| i[:item_id] }
current_matches.select! do |match_row|
previous_ids.include?(match_row[:item_id])
end
But I can't see a way of using that approach while matching the time condition in each item.
In terms of data, current_matches could be 300 and previous_matches_collection
could be 1k+. Is there a way of doing this without iterating through 300,000 combinations?
edit - Sample data:
previous_matches_collection = [
{ item_id: 1, start_time: 1597094395.1195982, start_time_range: (1597094393.6195982..1597094396.6195982) },
{ item_id: 1, start_time: 1597095083.116646, start_time_range: (1597095081.616646..1597095084.616646) },
{ item_id: 1, start_time: 1597095403.028223, start_time_range: (1597095401.528223..1597095404.528223) },
{ item_id: 2, start_time: 1597098035.056944, start_time_range: (1597098033.556944..1597098036.556944) },
{ item_id: 3, start_time: 1597096073.4109557, start_time_range: (1597096071.9109557..1597096074.9109557) },
{ item_id: 4, start_time: 1597094785.6987526, start_time_range: (1597094784.1987526..1597094787.1987526) },
{ item_id: 4, start_time: 1597098077.41271, start_time_range: (1597098075.91271..1597098078.91271) }
]
current_matches = [
{ item_id: 1, start_time: 1597094395.9195982 },
{ item_id: 1, start_time: 1597095085.116646, },
{ item_id: 1, start_time: 1597095404.228223, },
{ item_id: 2, start_time: 1597094395.1195982 },
{ item_id: 4, start_time: 1597094395.1195982 },
{ item_id: 6, start_time: 1597094395.1195982 },
{ item_id: 17, start_time: 1597094395.1195982 }
]
Upvotes: 2
Views: 104
Reputation: 110675
h = previous_matches_collection.each_with_object({}) do |g,h|
id = g[:item_id]
h[id] = (h[id] || []) << g[:start_time_range]
end
#=> {1=>[1597094393.6195982..1597094396.6195982,
# 1597095081.616646..1597095084.616646,
# 1597095401.528223..1597095404.528223],
# 2=>[1597098033.556944..1597098036.556944],
# 3=>[1597096071.9109557..1597096074.9109557],
# 4=>[1597094784.1987526..1597094787.1987526,
# 1597098075.91271..1597098078.91271]}
current_matches.select do |g|
id = g[:item_id]
h.key?(id) && h[id].any? { |a| a.cover?(g[:start_time]) }
end
#=> [{:item_id=>1, :start_time=>1597094395.919598},
# {:item_id=>1, :start_time=>1597095404.228223}]
See Range#cover? and Enumerable#any?.
If the first expression if h
does not have a key id = g[:item_id]
, h[id] = (h[id] || [])
sets h[id] #=> []
(because (h[id] || []) => (nil || []) => []
) after which h[id] << g[:start_time_range]
is executed. One could alternatively write
h = previous_matches_collection.
each_with_object(Hash.new { |h,k| h[k] = [] }) do |g,h|
h[g[:item_id]] << g[:start_time_range]
end
This makes the object h
an initially-empty hash with a default proc that executes h[k] = []
if h[k]
is executed when h
has no key k
. See the third form of Hash::new.
Upvotes: 1
Reputation: 21120
One easy optimisation to make is to not use any?
to find the correct id
. Instead make a lookup hash to fetch all the previous_matches_collection
elements with the correct id
O(1).
Another optimization to make is to use cover?
instead of include?
. The difference being that cover?
only compares the element with the begin
and end
of a range. While include?
uses succ
(success eg. 1.succ #=> 2
) on the begin
element to generate an collection though which it looks for the element.
("a".."z").include?("cc") #=> false
# is similar to:
# ["a", "b", "c", ..., "x", "y", "z"].include?("cc") #=> false
("a".."z").cover?("cc") #=> true
# is similar to:
# "a" <= "cc" && "cc <= "z" #=> true
The above code block demonstrates the difference between the two. In your scenario you just want to know if the value is within the range, so cover?
suits better and is the faster option.
start_time_ranges_by_item_id = previous_matches_collection
.group_by { |match| match[:item_id] }
.transform_values { |matches| matches.map { |match| match[:start_time_range] } }
start_time_ranges_by_item_id.default = []
Now with the start_time_ranges_by_item_id
hash build we should be able to jump to the relevant ranges directly and start the checks from there.
current_matches.select! do |match_row|
item_id, start_time = match_row.values_at(:item_id, :start_time)
start_time_ranges = start_time_ranges_by_item_id[item_id]
start_time_ranges.any? { |range| range.cover?(start_time) }
end
Upvotes: 2
Reputation: 46409
Just create a Hash mapping the previous matches to the timestamp that it started at.
Then for each current_match, do a fetch
to get the timestamp if present, and then test whether the timestamp meets the condition.
If previous_matches_collection
had 1000
things and current_matches
has 300
then this is 1300
hash operations, each of which is O(1)
. This should scale better than your current solution.
Upvotes: 1