Matt
Matt

Reputation: 931

SQL / Oracle Aggregation Buckets Between Dates

I have a SQL related question I would love some help with as a suitable answer has been eluding me for some time.

Background

I’m working with a vendor product which has an Oracle Database which serves as the backend. I have the ability to write any adhoc SQL to query the underlying tables, but I cannot make any changes to their underlying structure (or to the data model itself). The table I’m interested currently has about +1M rows and essentially tracks users sessions. It has 4 columns of interest: session_id (which is a primary key and unique per session), user_name, start_date (date which tracks the beginning of the session), and stop_date (date which tracks the end of the session). My goal is to perform the aggregation of data for active sessions based on month, day, and hour give a set start date and end date. I need to create a view (or 3 separate views) which can either perform the aggregation itself or serve as the intermediate object from which I can then query and perform the aggregation. I understand the eventual SQL / view may actually need to be 3 different views (one for month, one for day, one for hour), but it seems to me that the concept (once achieved) should be the same regardless of the time period.

Current table example

Table Name = web_session

| Session_id | user_name | start_date            | stop_date    
----------------------------------------------------------------------------
|      1     |    joe    | 4/20/2017 10:42:10 PM | 4/21/2017 2:42:10 AM  |
|      2     |   matt    | 4/20/2017 5:43:10 PM  | 4/20/2017 5:59:10 PM  |
|      3     |   matt    | 4/20/2017 3:42:10 PM  | 4/20/2017 5:42:10 PM  |
|      4     |    joe    | 4/20/2017 11:20:10 AM | 4/20/2017 4:42:10 PM  |
|      5     |   john    | 4/20/2017 8:42:10 AM  | 4/20/2017 11:42:10 AM |
|      6     |   matt    | 4/20/2017 7:42:10 AM  | 4/20/2017 11:42:10 PM | 
|      7     |    joe    | 4/19/2017 11:20:10 PM | 4/20/2017 1:42:10 AM  |

Ideal Output For Hour View

-12:00 can be either 0 or 24 for the example

| Date           | HR   | active_sessions | distinct_users |
------------------------------------------------------------
| 4/21/2017      | 2    | 1               | 1              |
| 4/21/2017      | 1    | 1               | 1              |
| 4/20/2017      | 0    | 1               | 1              |
| 4/20/2017      | 23   | 1               | 1              |
| 4/20/2017      | 22   | 1               | 1              |
| 4/20/2017      | 17   | 2               | 1              |
| 4/20/2017      | 16   | 2               | 2              |
| 4/20/2017      | 15   | 2               | 2              |
| 4/20/2017      | 14   | 1               | 1              |
| 4/20/2017      | 13   | 1               | 1              |
| 4/20/2017      | 12   | 1               | 1              |
| 4/20/2017      | 11   | 3               | 3              |
| 4/20/2017      | 10   | 2               | 2              |
| 4/20/2017      | 9    | 2               | 2              |
| 4/20/2017      | 8    | 2               | 2              |
| 4/20/2017      | 7    | 1               | 1              |
| 4/20/2017      | 1    | 1               | 1              |
| 4/20/2017      | 0    | 1               | 1              |
| 4/19/2017      | 23   | 1               | 1              |

End Goal and Other Options

What I am eventually trying to achieve with this output is to populate a line chart which displays the number of active sessions for either a month, day, or hour (used in the example output) between two dates. In the hour example, the date in combination with the HR would be used along the X-axis and the active sessions would be used along the Y-axis. The distinct user count would be available if a user hovered over the point on the chart. FYI Active sessions are the total number of sessions that were open at any point during the interval. Distinct users are the total number of distinct users during the interval. If I logged on and off twice in the same hour, it would be 2 active sessions, but only 1 distinct user.

Alternative Solutions

This seems to be a problem which must have come up may times before, but from all of my googling and stack overflow research I cannot seem to find the correct approach. If I am thinking about the query or ideal output incorrectly I AM OPEN TO ALTERNATE SUGGESTIONS which allow me to get the desired output to populate the chart appropriately on the front end.

Some SQL I Have Tried (Good Faith Effort)

There are many queries I've tried, but I'll start with this one as it is the closest I got but is extremely slow (unusably so)and it still does not produce the result I need.

Select * FROM (

  SELECT 
  u.YearDt, u.MonthDt, u.DayDt, u.HourDt, u.MinDt,
  COUNT(Distinct u.session_id) as unique_sessions,
  COUNT(Distinct u.user_name) as unique_users,
  LISTAGG(u.user_name, ', ') WITHIN GROUP (ORDER BY u.user_name ASC) as users
  FROM
  (SELECT  EXTRACT(year FROM l.start_date) as YearDt,
              EXTRACT(month FROM l.start_date) as MonthDt,
              EXTRACT(day FROM l.start_date) as DayDt,
              EXTRACT(HOUR FROM CAST(l.start_date AS TIMESTAMP)) as HourDt,
              EXTRACT(MINUTE FROM CAST(l.start_date AS TIMESTAMP)) as MinDt,
              l.session_id,
              l.user_name,
              l.start_date as act_date,
              1 as is_start
  FROM web_session l
  UNION ALL
  SELECT  EXTRACT(year FROM l.stop_date) as YearDt,
              EXTRACT(month FROM l.stop_date) as MonthDt,
              EXTRACT(day FROM l.stop_date) as DayDt,
              EXTRACT(HOUR FROM CAST(l.stop_date AS TIMESTAMP)) as HourDt,
              EXTRACT(MINUTE FROM CAST(l.stop_date AS TIMESTAMP)) as MinDt,
              l.session_id,
              l.user_name,
              l.stop_date as act_date,
              0 as is_start
  FROM web_session l
  ) u
  GROUP BY CUBE ( u.YearDt, u.MonthDt, u.DayDt, u.HourDt, u.MinDt)
) c

Upvotes: 3

Views: 2060

Answers (3)

MT0
MT0

Reputation: 167962

You can use a CTE (Query 1) or a correlated hierarchical query (Query 2) to generate the hours within the time ranges and then aggregate. This only requires a single table scan:

SQL Fiddle

Oracle 11g R2 Schema Setup:

CREATE TABLE Web_Session ( Session_id, user_name, start_date, stop_date ) AS
SELECT 1, 'joe',  CAST( TIMESTAMP '2017-04-20 22:42:10' AS DATE ), CAST( TIMESTAMP '2017-04-21 02:42:10' AS DATE ) FROM DUAL UNION ALL
SELECT 2, 'matt', TIMESTAMP '2017-04-20 17:43:10', TIMESTAMP '2017-04-20 17:59:10' FROM DUAL UNION ALL
SELECT 3, 'matt', TIMESTAMP '2017-04-20 15:42:10', TIMESTAMP '2017-04-20 17:42:10' FROM DUAL UNION ALL
SELECT 4, 'joe',  TIMESTAMP '2017-04-20 11:20:10', TIMESTAMP '2017-04-20 16:42:10' FROM DUAL UNION ALL
SELECT 5, 'john', TIMESTAMP '2017-04-20 08:42:10', TIMESTAMP '2017-04-20 11:42:10' FROM DUAL UNION ALL
SELECT 6, 'matt', TIMESTAMP '2017-04-20 07:42:10', TIMESTAMP '2017-04-20 23:42:10' FROM DUAL UNION ALL
SELECT 7, 'joe',  TIMESTAMP '2017-04-19 23:20:10', TIMESTAMP '2017-04-20 01:42:10' FROM DUAL;

Query 1:

WITH hours ( session_id, user_name, hour, duration ) AS (
  SELECT session_id,
         user_name,
         CAST( TRUNC( start_date, 'HH24' ) AS DATE ),
         ( TRUNC( stop_date, 'HH24' ) - TRUNC( start_date, 'HH24' ) ) * 24
  FROM   web_session
UNION ALL
  SELECT session_id,
         user_name,
         hour + INTERVAL '1' HOUR, -- There is a bug in SQLFiddle that subtracts
                                   -- hours instead of adding so -1 is used there.
         duration - 1
  FROM   hours
  WHERE  duration > 0
)
SELECT hour,
       COUNT( session_id ) AS active_sessions,
       COUNT( DISTINCT user_name ) AS distinct_users
FROM   hours
GROUP BY hour
ORDER BY hour

Results:

|                 HOUR | ACTIVE_SESSIONS | DISTINCT_USERS |
|----------------------|-----------------|----------------|
| 2017-04-19T23:00:00Z |               1 |              1 |
| 2017-04-20T00:00:00Z |               1 |              1 |
| 2017-04-20T01:00:00Z |               1 |              1 |
| 2017-04-20T07:00:00Z |               1 |              1 |
| 2017-04-20T08:00:00Z |               2 |              2 |
| 2017-04-20T09:00:00Z |               2 |              2 |
| 2017-04-20T10:00:00Z |               2 |              2 |
| 2017-04-20T11:00:00Z |               3 |              3 |
| 2017-04-20T12:00:00Z |               2 |              2 |
| 2017-04-20T13:00:00Z |               2 |              2 |
| 2017-04-20T14:00:00Z |               2 |              2 |
| 2017-04-20T15:00:00Z |               3 |              2 |
| 2017-04-20T16:00:00Z |               3 |              2 |
| 2017-04-20T17:00:00Z |               3 |              1 |
| 2017-04-20T18:00:00Z |               1 |              1 |
| 2017-04-20T19:00:00Z |               1 |              1 |
| 2017-04-20T20:00:00Z |               1 |              1 |
| 2017-04-20T21:00:00Z |               1 |              1 |
| 2017-04-20T22:00:00Z |               2 |              2 |
| 2017-04-20T23:00:00Z |               2 |              2 |
| 2017-04-21T00:00:00Z |               1 |              1 |
| 2017-04-21T01:00:00Z |               1 |              1 |
| 2017-04-21T02:00:00Z |               1 |              1 |

Execution Plan:

-------------------------------------------------------------------------------------------------------
| Id  | Operation                                      | Name        | Rows | Bytes | Cost | Time     |
-------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                               |             |   14 |   364 |    7 | 00:00:01 |
|   1 |   SORT GROUP BY                                |             |   14 |   364 |    7 | 00:00:01 |
|   2 |    VIEW                                        | VW_DAG_0    |   14 |   364 |    7 | 00:00:01 |
|   3 |     HASH GROUP BY                              |             |   14 |   364 |    7 | 00:00:01 |
|   4 |      VIEW                                      |             |   14 |   364 |    6 | 00:00:01 |
|   5 |       UNION ALL (RECURSIVE WITH) BREADTH FIRST |             |      |       |      |          |
|   6 |        TABLE ACCESS FULL                       | WEB_SESSION |    7 |   245 |    3 | 00:00:01 |
| * 7 |        RECURSIVE WITH PUMP                     |             |      |       |      |          |
-------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
------------------------------------------
* 7 - filter("DURATION">0)

Note
-----
- dynamic sampling used for this statement

Query 2:

SELECT t.COLUMN_VALUE AS hour,
       COUNT( session_id ) AS active_sessions,
       COUNT( DISTINCT user_name ) AS distinct_users
FROM   web_session w
       CROSS JOIN
       TABLE(
         CAST(
           MULTISET(
             SELECT TRUNC( w.start_date, 'HH24' ) + ( LEVEL - 1 ) / 24
             FROM   DUAL
             CONNECT BY TRUNC( w.start_date, 'HH24' ) + ( LEVEL - 1 ) / 24 < w.stop_date
           ) AS SYS.ODCIDATELIST
         )
       ) t
GROUP BY t.COLUMN_VALUE
ORDER BY hour

Results:

|                 HOUR | ACTIVE_SESSIONS | DISTINCT_USERS |
|----------------------|-----------------|----------------|
| 2017-04-19T23:00:00Z |               1 |              1 |
| 2017-04-20T00:00:00Z |               1 |              1 |
| 2017-04-20T01:00:00Z |               1 |              1 |
| 2017-04-20T07:00:00Z |               1 |              1 |
| 2017-04-20T08:00:00Z |               2 |              2 |
| 2017-04-20T09:00:00Z |               2 |              2 |
| 2017-04-20T10:00:00Z |               2 |              2 |
| 2017-04-20T11:00:00Z |               3 |              3 |
| 2017-04-20T12:00:00Z |               2 |              2 |
| 2017-04-20T13:00:00Z |               2 |              2 |
| 2017-04-20T14:00:00Z |               2 |              2 |
| 2017-04-20T15:00:00Z |               3 |              2 |
| 2017-04-20T16:00:00Z |               3 |              2 |
| 2017-04-20T17:00:00Z |               3 |              1 |
| 2017-04-20T18:00:00Z |               1 |              1 |
| 2017-04-20T19:00:00Z |               1 |              1 |
| 2017-04-20T20:00:00Z |               1 |              1 |
| 2017-04-20T21:00:00Z |               1 |              1 |
| 2017-04-20T22:00:00Z |               2 |              2 |
| 2017-04-20T23:00:00Z |               2 |              2 |
| 2017-04-21T00:00:00Z |               1 |              1 |
| 2017-04-21T01:00:00Z |               1 |              1 |
| 2017-04-21T02:00:00Z |               1 |              1 |

Execution Plan:

--------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name        | Rows  | Bytes   | Cost | Time     |
--------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |             | 57176 | 2115512 |  200 | 00:00:03 |
|   1 |   SORT GROUP BY                        |             | 57176 | 2115512 |  200 | 00:00:03 |
|   2 |    NESTED LOOPS                        |             | 57176 | 2115512 |  195 | 00:00:03 |
|   3 |     TABLE ACCESS FULL                  | WEB_SESSION |     7 |     245 |    3 | 00:00:01 |
|   4 |     COLLECTION ITERATOR SUBQUERY FETCH |             |  8168 |   16336 |   27 | 00:00:01 |
| * 5 |      CONNECT BY WITHOUT FILTERING      |             |       |         |      |          |
|   6 |       FAST DUAL                        |             |     1 |         |    2 | 00:00:01 |
--------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
------------------------------------------
* 5 - filter(TRUNC(:B1,'fmhh24')+(LEVEL-1)/24<:B2)

Note
-----
- dynamic sampling used for this statement

Upvotes: 2

Sentinel
Sentinel

Reputation: 6449

Matt,

What you need to do is generate a time dimension either as a static table or dynamically at run time:

create table time_dim (
  ts date primary key,
  year number not null,
  month number not null,
  day number not null,
  wday number not null,
  dy varchar2(3) not null,
  hr number not null
);

insert into time_dim (ts, year, month, day, wday, dy, hr)
select ts
     , extract(year from ts) year
     , extract(month from ts) month
     , extract(day from ts) day
     , to_char(ts,'d') wday
     , to_char(ts,'dy') dy
     , to_number(to_char(ts,'HH24')) hr
  from (
select DATE '2017-01-01' + (level - 1)/24 ts
  FROM DUAL connect by level <= 365*24) a;

Then outer join that to your web_sessions table:

select t.ts, t.year, t.month, t.wday, t.dy, t.hr
     , count(session_id) sessions
     , count(distinct user_name) users
  from time_dim t
  left join web_session w
    on t.ts between trunc(w.start_date, 'hh24') and w.stop_date
 where trunc(t.ts) between date '2017-04-19' and date '2017-04-21'
 group by rollup (t.year, t.month, (t.wday, t.dy), (t.hr, t.ts));

You can change up the group by clause to get the various aggregates you're interested in.

In the above code, I'm truncating the start_date to the hour in the ON clause so that the start hour will be included in the results otherwise sessions that don't start exactly at the top of the hour would not get counted in that hour.

Upvotes: 1

David Faber
David Faber

Reputation: 12485

I think something like this will work:

WITH ct ( active_dt ) AS (
    -- Build the query for the "table" of hours
    SELECT DATE'2018-04-19' + (LEVEL-1)/24 AS active_dt FROM dual
   CONNECT BY DATE'2018-04-19' + (LEVEL-1)/24 < DATE'2018-04-22'
)
SELECT active_dt AS "Date", active_hr AS "HR"
     , COUNT(session_id) AS active_sessions
     , COUNT(DISTINCT user_name) AS distinct_users
  FROM (
    SELECT TRUNC(ct.active_dt) AS active_dt
         , TO_CHAR(ct.active_dt, 'HH24') AS active_hr
         , ws.session_id, ws.user_name
      FROM ct LEFT JOIN web_session ws
        ON ct.active_dt + 1/24 >= ws.start_dt
       AND ct.active_dt < ws.stop_dt
) GROUP BY active_dt, active_hr
 ORDER BY active_dt DESC, active_hr DESC;

I may not have the conditions for the LEFT JOIN 100% correct.

Hope this helps.

Upvotes: 1

Related Questions