Chris Burrus
Chris Burrus

Reputation: 187

Selecting greatest date range count in a rails array

I have a database with a bunch of deviceapi entries, that have a start_date and end_date (datetime in the schema) . Typically these entries no more than 20 seconds long (end_date - start_date). I have the following setup:

data = Deviceapi.all.where("start_date > ?", DateTime.now - 2.weeks)

I need to get the hour within data that had the highest number of Deviceapi entries. To make it a bit clearer, this was my latest try on it (code is approximated, don't mind typos):

runningtotal = 0
(2.weeks / 1.hour).to_i.times do |interval|
  current = data.select{ |d| d.start_time > (start_date + (1.hour * (interval - 1))) }.select{ |d| d.end_time < (start_date + (1.hour * interval)) }.count
  if current > runningtotal
    runningtotal = current
end

The problem: this code works just fine. So did about a dozen other incarnations of it, using .where, .select, SQL queries, etc. But it is too slow. Waaaaay too slow. Because it has to loop through every hour within 2 weeks. Then this method might need to be called itself dozens of times.

There has to be a faster way to do this, maybe a sort? I'm stumped, and I've been searching for hours with no luck. Any ideas?

Upvotes: 0

Views: 98

Answers (1)

khampson
khampson

Reputation: 15316

To get adequate performance, you'll want to do everything in a single query, which will mean avoiding ActiveRecord functionality and doing a raw query (e.g. via ActiveRecord::Base.connection.execute).

I have no way to test it, since I have neither your data nor schema, but I think something along these lines will do what you are looking for:

select y.starting_hour, max(y.num_entries) as max_entries
from
(
    select x.starting_hour, count(*) as num_entries
    from
    (
        select date_trunc('hour', start_time) starting_hour
        from deviceapi as d
    ) as x
    group by x.starting_hour
) as y
where y.num_entries = max(y.num_entries);

The logic of this is as follows, from the inner-most query out:

  1. "Bucket" each starting time to the hour
  2. From the resulting table of buckets, get the total number of entries in each bucket
  3. Get the maximum number of entries from that table, and then use that number to match back to get the starting_hour itself.

If there happen to be more than one bucket with the same number of entries, you could determine a consistent way to pick one -- say the min(starting_hour) or similar (since that would stay the same even as data gets added, assuming you are not deleting items).

If you wanted to limit the initial time slice -- I see 2 weeks referenced in your post -- you could do that in the inner-most query with a where clause bracketing the date range.

Upvotes: 1

Related Questions