TomDunning
TomDunning

Reputation: 4877

Find matching items between two arrays with difficult match condition

We have two lists, a list of events each with an id, a start_time, and a start_time_rage. The start_time_range puts a tolerance around start_time to find near misses.

The objective is to filter the current_matches, to only include those that appear in the previous matches. An item "appears" in the list if the id matches and the start_time is within the range.

To achieve this, I have this loop, but with our ever increasing data size it is becoming very slow. I need to optimise it:

current_matches.select! do |match_row|
  previous_matches_collection.any? do |previous_match|
    previous_match[:item_id] == match_row[:item_id] &&
      previous_match[:start_time_range].include?(match_row[:start_time].to_f)
  end
end

If it were just the item_id I needed I could do this:

previous_ids = previous_matches_collection.collect{|i| i[:item_id] }
current_matches.select! do |match_row|
   previous_ids.include?(match_row[:item_id])
end

But I can't see a way of using that approach while matching the time condition in each item.

In terms of data, current_matches could be 300 and previous_matches_collection could be 1k+. Is there a way of doing this without iterating through 300,000 combinations?

edit - Sample data:

previous_matches_collection = [
  { item_id: 1, start_time: 1597094395.1195982, start_time_range: (1597094393.6195982..1597094396.6195982) },
  { item_id: 1, start_time: 1597095083.116646, start_time_range: (1597095081.616646..1597095084.616646) },
  { item_id: 1, start_time: 1597095403.028223, start_time_range: (1597095401.528223..1597095404.528223) },
  { item_id: 2, start_time: 1597098035.056944, start_time_range: (1597098033.556944..1597098036.556944) },
  { item_id: 3, start_time: 1597096073.4109557, start_time_range: (1597096071.9109557..1597096074.9109557) },
  { item_id: 4, start_time: 1597094785.6987526, start_time_range: (1597094784.1987526..1597094787.1987526) },
  { item_id: 4, start_time: 1597098077.41271, start_time_range: (1597098075.91271..1597098078.91271) }
]


current_matches = [
  { item_id: 1, start_time: 1597094395.9195982 },
  { item_id: 1, start_time: 1597095085.116646, },
  { item_id: 1, start_time: 1597095404.228223, },
  { item_id: 2, start_time: 1597094395.1195982 },
  { item_id: 4, start_time: 1597094395.1195982 },
  { item_id: 6, start_time: 1597094395.1195982 },
  { item_id: 17, start_time: 1597094395.1195982 }
]

Upvotes: 2

Views: 104

Answers (3)

Cary Swoveland
Cary Swoveland

Reputation: 110675

h = previous_matches_collection.each_with_object({}) do |g,h|
  id = g[:item_id]
  h[id] = (h[id] || []) << g[:start_time_range]
end
  #=> {1=>[1597094393.6195982..1597094396.6195982,
  #        1597095081.616646..1597095084.616646,
  #        1597095401.528223..1597095404.528223],
  #    2=>[1597098033.556944..1597098036.556944],
  #    3=>[1597096071.9109557..1597096074.9109557],
  #    4=>[1597094784.1987526..1597094787.1987526,
  #        1597098075.91271..1597098078.91271]}   
current_matches.select do |g|
  id = g[:item_id]
  h.key?(id) && h[id].any? { |a| a.cover?(g[:start_time]) }
end
  #=> [{:item_id=>1, :start_time=>1597094395.919598},
  #    {:item_id=>1, :start_time=>1597095404.228223}] 

See Range#cover? and Enumerable#any?.

If the first expression if h does not have a key id = g[:item_id], h[id] = (h[id] || []) sets h[id] #=> [] (because (h[id] || []) => (nil || []) => []) after which h[id] << g[:start_time_range] is executed. One could alternatively write

h = previous_matches_collection.
    each_with_object(Hash.new { |h,k| h[k] = [] }) do |g,h|
  h[g[:item_id]] << g[:start_time_range]
end

This makes the object h an initially-empty hash with a default proc that executes h[k] = [] if h[k] is executed when h has no key k. See the third form of Hash::new.

Upvotes: 1

3limin4t0r
3limin4t0r

Reputation: 21120

One easy optimisation to make is to not use any? to find the correct id. Instead make a lookup hash to fetch all the previous_matches_collection elements with the correct id O(1).

Another optimization to make is to use cover? instead of include?. The difference being that cover? only compares the element with the begin and end of a range. While include? uses succ (success eg. 1.succ #=> 2) on the begin element to generate an collection though which it looks for the element.

("a".."z").include?("cc") #=> false
# is similar to:
# ["a", "b", "c", ..., "x", "y", "z"].include?("cc") #=> false

("a".."z").cover?("cc") #=> true
# is similar to:
# "a" <= "cc" && "cc <= "z" #=> true

The above code block demonstrates the difference between the two. In your scenario you just want to know if the value is within the range, so cover? suits better and is the faster option.

start_time_ranges_by_item_id = previous_matches_collection
  .group_by { |match| match[:item_id] }
  .transform_values { |matches| matches.map { |match| match[:start_time_range] } }
start_time_ranges_by_item_id.default = []

Now with the start_time_ranges_by_item_id hash build we should be able to jump to the relevant ranges directly and start the checks from there.

current_matches.select! do |match_row|
  item_id, start_time = match_row.values_at(:item_id, :start_time)
  start_time_ranges = start_time_ranges_by_item_id[item_id]
  start_time_ranges.any? { |range| range.cover?(start_time) }
end

Upvotes: 2

btilly
btilly

Reputation: 46409

Just create a Hash mapping the previous matches to the timestamp that it started at.

Then for each current_match, do a fetch to get the timestamp if present, and then test whether the timestamp meets the condition.

If previous_matches_collection had 1000 things and current_matches has 300 then this is 1300 hash operations, each of which is O(1). This should scale better than your current solution.

Upvotes: 1

Related Questions