Bill Kervaski
Bill Kervaski

Reputation: 603

SQL average time between first and second row in a set

Ouch! My heard hurts .. thought I had it a few times but epic fail :(

I have the following data, millions of rows, indexed, MySQL 5.6.

In this table, there are sets of data, and the uuid is basically a unique id for each set of data.

I need to find the AVERAGE in the data for each 1st and 2nd rows in each set. In other words, how much time past since the set was created with the first insert and the second insert for that same set and then the average of the results.

I can get the average no problem, I just can't seem to get my head around a way to just get the time difference between 1st and 2nd row in each set.

I'm not even going to embarrass myself and paste my broken SQL with my misguided attempts using sub-queries and LIMIT, suffice to say, this one escapes me.

Any help appreciated, beers on me :/

+------+-----------------------------------------+----------------------------+ | id | uuid | stamp | +------+-----------------------------------------+----------------------------+ | 707 | 60b5-d062-5829-c11d-5b71-5d85-075b-a3c5 | 2020-01-01 17:00:28.000000 | | 708 | 60b5-d062-5829-c11d-5b71-5d85-075b-a3c5 | 2020-01-01 17:01:30.000000 | | 709 | 0ccf-94e0-ce72-8092-1975-5bea-6131-c719 | 2020-01-02 14:11:48.000000 | | 710 | 59c8-60ee-d172-511a-a477-c637-6789-f14a | 2020-01-02 14:23:36.000000 | | 711 | b33b-7584-1fed-e138-28ba-c24a-9b46-88e7 | 2020-01-02 14:24:07.000000 | | 712 | eddc-b12a-5ef2-baea-cf53-7287-5805-d922 | 2020-01-02 14:24:26.000000 | | 713 | 257b-fc66-6d7a-ba21-727e-1da7-0ee1-714c | 2020-01-02 14:25:31.000000 | | 718 | c5d9-acba-9a12-aacb-cf45-c5a9-2b8d-314c | 2020-01-02 15:46:41.000000 | | 719 | 0ccf-94e0-ce72-8092-1975-5bea-6131-c719 | 2020-01-02 15:55:42.000000 | | 720 | c5d9-acba-9a12-aacb-cf45-c5a9-2b8d-314c | 2020-01-02 15:56:33.000000 | | 722 | c5d9-acba-9a12-aacb-cf45-c5a9-2b8d-314c | 2020-01-02 16:16:14.000000 | | 723 | c5d9-acba-9a12-aacb-cf45-c5a9-2b8d-314c | 2020-01-02 16:21:25.000000 | | 726 | 6610-a9df-358d-0065-beb8-cea1-82a6-3258 | 2020-01-02 17:16:33.000000 | | 727 | 6610-a9df-358d-0065-beb8-cea1-82a6-3258 | 2020-01-02 17:21:20.000000 | | 728 | 6610-a9df-358d-0065-beb8-cea1-82a6-3258 | 2020-01-02 17:45:07.000000 | | 729 | 6610-a9df-358d-0065-beb8-cea1-82a6-3258 | 2020-01-02 17:50:17.000000 | | 730 | 6610-a9df-358d-0065-beb8-cea1-82a6-3258 | 2020-01-02 18:14:02.000000 | | 731 | 6610-a9df-358d-0065-beb8-cea1-82a6-3258 | 2020-01-02 18:27:48.000000 | | 732 | 6610-a9df-358d-0065-beb8-cea1-82a6-3258 | 2020-01-02 18:28:57.000000 | | 733 | c193-a46f-1104-3ee3-7387-94a8-ef32-a85e | 2020-01-02 18:40:40.000000 | | 734 | c193-a46f-1104-3ee3-7387-94a8-ef32-a85e | 2020-01-02 18:40:49.000000 |

Upvotes: 1

Views: 143

Answers (2)

fifonik
fifonik

Reputation: 1606

Another way is use LEFT JOINs instead of sub-query.

SELECT
      t.uuid
    , t.stamp AS t_stamp
    , t_next.stamp AS t_next_stamp
    , TIME_TO_SEC(TIMEDIFF(t_next.stamp, t.stamp)) AS diff
FROM
    ttt AS t
    LEFT JOIN ttt AS t_prev ON (
            t_prev.uuid  = t.uuid
        AND t_prev.stamp < t.stamp
    )
    INNER JOIN ttt AS t_next ON (
            t_next.uuid  = t.uuid
        AND t_next.stamp > t.stamp
    )
    LEFT JOIN ttt AS t_before_next ON (
            t_before_next.uuid  = t.uuid
        AND t_before_next.stamp > t.stamp
        AND t_before_next.stamp < t_next.stamp 
    )
WHERE
        t_prev.id IS NULL -- no t_prev so t is the first record
    AND t_before_next.id IS NULL -- no t_before_next so t_next is the second record
    -- filter data by your criteria, per day for example.
    -- you will need to "duplicate" filtering conditions for t_prev and t_next
ORDER BY
    uuid

=>

uuid    t_stamp t_next_stamp    diff
0ccf-94e0-ce72-8092-1975-5bea-6131-c719 2020-01-02 14:11:48 2020-01-02 15:55:42 6234
60b5-d062-5829-c11d-5b71-5d85-075b-a3c5 2020-01-01 17:00:28 2020-01-01 17:01:30 62
6610-a9df-358d-0065-beb8-cea1-82a6-3258 2020-01-02 17:16:33 2020-01-02 17:21:20 287
c193-a46f-1104-3ee3-7387-94a8-ef32-a85e 2020-01-02 18:40:40 2020-01-02 18:40:49 9
c5d9-acba-9a12-aacb-cf45-c5a9-2b8d-314c 2020-01-02 15:46:41 2020-01-02 15:56:33 592

Caveat:

The query above will miss records with the same stamp. If you need them, you have to make changes in join conditions:

from

t_prev.stamp < t.stamp

to

t_prev.stamp <= t.stamp AND t_prev.id < t.id

etc

Then you can use the query to get AVG:

-- explain
SELECT
    AVG(TIME_TO_SEC(TIMEDIFF(t_next.stamp, t.stamp))) AS avg_diff
FROM
    ttt AS t
    LEFT JOIN ttt AS t_prev ON (
            t_prev.uuid  = t.uuid
        AND t_prev.stamp < t.stamp
    )
    INNER JOIN ttt AS t_next ON (
            t_next.uuid  = t.uuid
        AND t_next.stamp > t.stamp
    )
    LEFT JOIN ttt AS t_before_next ON (
            t_before_next.uuid  = t.uuid
        AND t_before_next.stamp > t.stamp
        AND t_before_next.stamp < t_next.stamp 
    )
WHERE
        t_prev.id IS NULL
    AND t_before_next.id IS NULL

=> 1436.8000 (for your data set)

Explain with composite index (uuid, stamp):

id  select_type table   partitions  type    possible_keys   key key_len ref rows    filtered    Extra
1   SIMPLE  t   NULL    index   ix_uuid_stamp   ix_uuid_stamp   49  NULL    21  100.00  Using where; Using index
1   SIMPLE  t_prev  NULL    ref ix_uuid_stamp   ix_uuid_stamp   43  test.t.uuid 2   10.00   Using where; Not exists; Using index
1   SIMPLE  t_next  NULL    ref ix_uuid_stamp   ix_uuid_stamp   43  test.t.uuid 2   33.33   Using where; Using index
1   SIMPLE  t_before_next   NULL    ref ix_uuid_stamp   ix_uuid_stamp   43  test.t.uuid 2   10.00   Using where; Not exists; Using index

"ref" used instead of "depended sub-query" as in accepted answer. What is better depend on your data. If filtered data set (when you filter records by day) is small, "depended sub query" would be faster. On big filtered data sets I'd prefer to have "ref".

Feel free to test both ways and let us know what is faster in your case.

Upvotes: 2

Gordon Linoff
Gordon Linoff

Reputation: 1269883

If a user id only appears twice, then this is trivial. You have millions of rows, so let's try to avoid a sort and assume you have the right indexes.

Here is one way to get the earliest two rows:

select t.*
from t
where t.stamp <= (select t2.stamp
                  from t t2
                  where t2.uuid = t.uuid
                  order by t2.stamp asc
                  limit 1,1
                 );

Very important: You want an index on (uuid, stamp) for any hope of performance.

Then, just aggregate:

select uuid, timestampdiff(second, min(stamp), max(stamp))
from (select t.*
      from t
      where t.stamp <= (select t2.stamp
                        from t t2
                        where t2.uuid = t.uuid
                        order by t2.stamp asc
                        limit 1,1
                       )
     ) t
group by uuid;

Upvotes: 2

Related Questions