zaknotzach
zaknotzach

Reputation: 1005

SQLite query WHERE with OUTER JOIN

I am a bit rusty with my SQL and am running into a little issue with a query. In our application we have two relative tables to this problem. There are entries, and for each entry there are N steps.

We are trying to optimize our querying, so instead of asking for all entries all the time, we just ask for entries that were updated after we last checked. There can be a lot of steps, so this query is just supposed to return the entries and some step summary data, and we can separately query for steps if needed.

The entry start time and updated time are calculated from the first and most recent process step time respectively. We also have to group together entry statuses.

Here's the query as we build it in python, since it seems easier to read:

statement = 'SELECT e.serial_number, ' + \
                   'e.description, ' + \
                   'min(p.start_time) begin_time, ' + \
                   'group_concat(p.status) status, ' + \
                   'max(p.last_updated) last_updated, ' + \
            'FROM entries e ' + \
            'LEFT OUTER JOIN process_steps p ON e.serial_number = p.serial_number ' + \

# if the user provides a "since" date, only return entries updated after
# that date
if since is not None:
    statement += ' WHERE last_updated > "{0}"'.format(since)

statement += ' GROUP BY e.serial_number'

The issue we are having is that if we apply that WHERE clause, it filters the process steps too. So for example if we have this situation with two entries:

Entry: 123 foo
Steps:
    1. start time 10:00, updated 10:30, status completed
    2. start time 11:00, updated 11:30, status completed
    3. start time 12:00, updated 12:30, status failed
    4. start time 13:00, updated 13:30, status in_progress
Entry: 321 bar
Steps:
    1. start time 01:00, updated 01:30, status completed
    2. start time 02:00, updated 02:30, status completed

If we query without the where, we would get all entries. So for this case it would return:

321, bar, 01:00, "completed,completed", 02:30
123, foo, 10:00, "completed,completed,failed,in_progress", 13:30

If I had time of 12:15, then it would only return this:

123, foo, 12:00, "failed,in_progress", 13:30

In that result, the start time comes from step 3, and the statuses are only from steps 3 and 4. What I'm looking for is the whole entry:

123, foo, 10:00, "completed,completed,failed,in_progress", 13:30

So basically, I want to filter the final results based on that last_updated value, but it is currently filtering the join results as well, which throws off the begin_time, last_updated and status values since they are calculated with a partial set of steps. Any ideas how to modify the query to get what I want here?

Edit:

It seems like there might be some naming issues here too. The names I used in the example code are equal to or similar to what we actually have in our code. If we change max(p.last_updated) last_updated to max(p.last_updated) max_last_updated, and change the WHERE clause to use max_last_updated as well, we get OperationalError: misuse of aggregate: max() We have also tried adding AS statements in there with no difference.

Upvotes: 1

Views: 60

Answers (2)

Gordon Linoff
Gordon Linoff

Reputation: 1269563

You can do this with a having clause:

SELECT . . .
FROM entries e LEFT JOIN
     process_steps ps
     ON e.serial_number = ps.serial_number
GROUP BY e.serial_number
HAVING MAX(ps.last_updated) > <your value here>;

Upvotes: 1

Vasya
Vasya

Reputation: 469

Create a subquery that selects updated processes first:

SELECT whatever you need FROM entries e 
LEFT OUTER JOIN process_steps p ON e.serial_number = p.serial_number
WHERE e.serial_number in (SELECT distinct serial_number from process_steps 
WHERE last_updated > "date here")
GROUP BY e.serial_number

Upvotes: 2

Related Questions