Chris R
Chris R

Reputation: 17896

I need the most recent record in a join (PostgreSQL)

I have a table like so:

call_activity (
    call_id TEXT, 
    activity_type TEXT,
    activity_time TIMESTAMP,
    PRIMARY KEY(call_id, activity_type, activity_time)
)

activity_type may be one of about 9 different strings:

 'started'
 'completed' (about 5 variations on this)
 'other' (these are the states that I want to display)

A call has a series of events, starting with 'started' and culminating in a completed event (one of 5 or so possible ones). I need a view of this with two columns: The first column must be the activity time of a call's 'started' event, the second column must be the most recent event for that call. This view must only have calls for which there is no completed event.

I have a set of nested joins on it, but they're slow as hell. I need a reasonably optimal view of this. Can anyone help me?

Upvotes: 0

Views: 272

Answers (4)

Jonathan Hall
Jonathan Hall

Reputation: 79536

This solution I tested without any indexing, and on an incredibly small data set, so it will require some tweaking for your environment. You'll want an index on at least call_id (duh!) and activity_type. It also uses a custom aggregate function, LAST() (I use a similar FIRST() function in many of my own projects).

CREATE OR REPLACE FUNCTION slast(anyelement,anyelement) RETURNS anyelement AS $$
    SELECT $2
$$ LANGUAGE sql IMMUTABLE STRICT;

CREATE AGGREGATE last (
    sfunc = slast,
    basetype = anyelement,
    stype = anyelement
);

CREATE VIEW current_calls AS
SELECT min(activity_time) AS call_started,last(activity_type) AS current_activity
FROM (
    SELECT call_id,activity_time,activity_type
    FROM call_activity
    WHERE call_id NOT IN (SELECT call_id FROM call_activity WHERE activity_type='completed')
    ORDER BY activity_time
) AS x
GROUP BY call_id;

I make no guess whether this performs better or worse than some of the other proposals offered. I prefer it because (to me) it is more readable. But readability really must take a backseat to performance in things like this.

Upvotes: 1

Ferran
Ferran

Reputation: 14993

A solution without joins, using the CASE statement and grouping

    select call_id , 
       min(case when activity_type = 'started' then activity_time
                else null 
           end) as timestamp_started,
      max(activity_time) as timestamp_other
from call_activity
group by call_id
having 
       sum(case when activity_type = 'completed-1' then 1
                when activity_type = 'completed-2' then 1
                else 0
           end) = 0

Upvotes: 0

van
van

Reputation: 76952

Change datatypes for your IDs, and try something like this (add 'started' to the list of others filter if you would like to include 'started' in the latest non-completed activity as well):

SELECT      ca_s.activity_time AS timestamp_started,
            ca_o.activity_time AS timestamp_other
FROM        call_activity ca_s
LEFT  JOIN  call_activity ca_o
        ON  ca_s.call_id = ca_o.call_id
        AND ca_o.activity_type IN ('other-1', 'other2-2', ...)
LEFT  JOIN  call_activity ca_c
        ON  ca_s.call_id = ca_c.call_id
        AND ca_s.activity_type IN ('completed-1', 'completed-2', ...)
WHERE       ca_s.activity_type = 'started'
        AND ca_c.call_id IS NULL --// no complete events

Upvotes: 0

Ants Aasma
Ants Aasma

Reputation: 54872

To do this the db has to at least find all the started calls and find if any completed activities exist for it. Assuming that not-completed is a small set, fetching the most recent activity can be done as a subquery. Here's a query that does this:

SELECT c_started.call_id, c_started.activity_id AS started_time,
        (SELECT MAX(c_recent.activity_time)
            FROM call_activity AS c_recent
            WHERE c_recent.call_id = c_started.call_id) AS recent_activity
    FROM call_activity AS c_started
        LEFT JOIN call_activity AS c_completed
            ON c_started.call_id = c_completed.call_id
                AND c_completed.activity_type IN ('completed 1' 'completed 2', ...)
    WHERE c_started.activity_type = 'started'
        AND c_completed.call_id IS NULL;

If you can add indexes, the first choice would be a partial index on call_id where activity_type is in the completed events (the same check as in the join condition). Another would be an index on activity_type, possibly partial with only 'started' events to speed up the initial scan. Finally a call_id, activity_time index would speed up the subquery if you have lots of events per call. You can also get that if you reorder activity_type and activity_time in the primary key.

To make this fast, I would create an active_calls table with only a call_id column and add an insert trigger on call_activity to insert into active_calls if 'started' is inserted and delete if 'completed' is inserted.

Upvotes: 0

Related Questions