BrotherBarnabas
BrotherBarnabas

Reputation: 353

Complex SQL Query with a Pivot

I have 3 tables, users, trips and tripDetails. When the user creates a trip a row with the following fields is created in the trips table:

id(INT), user_id(INT) & dateCreated(DATE)

and 4 rows are created in the tripDetails table:

id(INT), trip_id(INT), field(VARCHAR) & value(VARCHAR) 

where field is either 'smartbox.destination', 'smartbox.dateFrom', 'smartbox.dateTo', 'smartbox.numberOfPeople' and the value column is a varchar. However, let's say the user changes the destination and saves this change, a new record is created in the trips table and ONLY ONE record is created in the tripDetails table (the updated destination)

Now I want to create a select which would give me a snapshot of a user's trip for a given day with the column headers:

user_id, trip_id, destination, dateFrom, dateTo, numberOfPeople, givenDay(DATE)

such that if one field had been changed on that given day, all the other columns would show what their most recent value was, relative to that day.

I've set up a sqlfiddle here

Upvotes: 1

Views: 253

Answers (1)

peter.hrasko.sk
peter.hrasko.sk

Reputation: 4141

First of all, allow me to say: You have some seriously flawed data model with that "overriding key/value pair" way of dealing with data.

And now for the solution to your problem. Assuming that

  • the tripDetails.value column is declared not null,
  • your SQL client asks you for the value of givenDate,
  • you supply your given date to the query in (exact) format yyyy-mm-dd,

your query could look like

with pivot$ as (
    select
        U.id as user_id, T.id as trip_id, max(T.dateCreated) as trip_date,
        max(decode(TD.field, 'smartbox.destination', TD.value)) as trip_destination,
        max(decode(TD.field, 'smartbox.dateFrom', TD.value)) as trip_date_from,
        max(decode(TD.field, 'smartbox.dateTo', TD.value)) as trip_date_to,
        max(decode(TD.field, 'smartbox.numberOfPeople', TD.value)) as trip_no_of_people
    from users U
        join trips T
            on T.user_id = U.id
        join tripDetails TD
            on TD.trip_id = T.id
            and TD.field in ('smartbox.destination', 'smartbox.dateFrom', 'smartbox.dateTo', 'smartbox.numberOfPeople')
    where T.dateCreated <= date'&givenDate'
    group by U.id, T.id
),
resolve_versioning$ as (
    select user_id, trip_id, trip_date,
        first_value(trip_destination) ignore nulls over (partition by user_id order by trip_date desc rows between current row and unbounded following) as trip_destination,
        first_value(trip_date_from) ignore nulls over (partition by user_id order by trip_date desc rows between current row and unbounded following) as trip_date_from,
        first_value(trip_date_to) ignore nulls over (partition by user_id order by trip_date desc rows between current row and unbounded following) as trip_date_to,
        first_value(trip_no_of_people) ignore nulls over (partition by user_id order by trip_date desc rows between current row and unbounded following) as trip_no_of_people,
        row_number() over (partition by user_id order by trip_date desc) as relevance$
    from pivot$
)
select user_id, trip_id,
    trip_destination, trip_date_from, trip_date_to, trip_no_of_people,
    date'&givenDate' as given_date
from resolve_versioning$
where relevance$ <= 1
;

This, in three steps, does:

  • The pivot$ subquery denormalizes your key/value pairs to wider rows, with trip_id as the dataset's logical primary key, effectively leaving columns NULL when there's no key/value pair for that trip_id. (btw., this is where the non-nullability of the tripDetails.value column is crucial for success of the query)
  • The resolve_versioning$ subquery takes advantage of the first_value() analytic function, working over each individual trip detail of all trips for a user (partition by user_id), finding the first (first_value) non-NULL (ignore nulls) value of the respective trip detail, searching from the "youngest" trip date back to older days (order by trip_date desc) ... Or, if you look at it the other way round, it looks for the last non-NULL value of the trip detail in an ordering by the trip date.
  • That rows between current row and unbounded following is a kind of "magic", necessary to correctly process the window for the particular analytical order by. (Read here for an explanation.)
  • The whole row_number() over (partition by user_id order by trip_date desc) simply numbers all resulting rows from 1 upwards, where 1 is assigned to the "youngest" row in the ordering by the trip date. Then, in the outermost select, the whole result is filtered to show only the youngest row (relevance$ <= 1).

Enjoy!

Upvotes: 1

Related Questions