Reputation: 931
I have a SQL related question I would love some help with as a suitable answer has been eluding me for some time.
Background
I’m working with a vendor product which has an Oracle Database which serves as the backend. I have the ability to write any adhoc SQL to query the underlying tables, but I cannot make any changes to their underlying structure (or to the data model itself). The table I’m interested currently has about +1M rows and essentially tracks users sessions. It has 4 columns of interest: session_id
(which is a primary key and unique per session), user_name
, start_date
(date which tracks the beginning of the session), and stop_date
(date which tracks the end of the session). My goal is to perform the aggregation of data for active sessions based on month, day, and hour give a set start date and end date. I need to create a view (or 3 separate views) which can either perform the aggregation itself or serve as the intermediate object from which I can then query and perform the aggregation. I understand the eventual SQL / view may actually need to be 3 different views (one for month, one for day, one for hour), but it seems to me that the concept (once achieved) should be the same regardless of the time period.
Current table example
Table Name = web_session
| Session_id | user_name | start_date | stop_date
----------------------------------------------------------------------------
| 1 | joe | 4/20/2017 10:42:10 PM | 4/21/2017 2:42:10 AM |
| 2 | matt | 4/20/2017 5:43:10 PM | 4/20/2017 5:59:10 PM |
| 3 | matt | 4/20/2017 3:42:10 PM | 4/20/2017 5:42:10 PM |
| 4 | joe | 4/20/2017 11:20:10 AM | 4/20/2017 4:42:10 PM |
| 5 | john | 4/20/2017 8:42:10 AM | 4/20/2017 11:42:10 AM |
| 6 | matt | 4/20/2017 7:42:10 AM | 4/20/2017 11:42:10 PM |
| 7 | joe | 4/19/2017 11:20:10 PM | 4/20/2017 1:42:10 AM |
Ideal Output For Hour View
-12:00 can be either 0 or 24 for the example
| Date | HR | active_sessions | distinct_users |
------------------------------------------------------------
| 4/21/2017 | 2 | 1 | 1 |
| 4/21/2017 | 1 | 1 | 1 |
| 4/20/2017 | 0 | 1 | 1 |
| 4/20/2017 | 23 | 1 | 1 |
| 4/20/2017 | 22 | 1 | 1 |
| 4/20/2017 | 17 | 2 | 1 |
| 4/20/2017 | 16 | 2 | 2 |
| 4/20/2017 | 15 | 2 | 2 |
| 4/20/2017 | 14 | 1 | 1 |
| 4/20/2017 | 13 | 1 | 1 |
| 4/20/2017 | 12 | 1 | 1 |
| 4/20/2017 | 11 | 3 | 3 |
| 4/20/2017 | 10 | 2 | 2 |
| 4/20/2017 | 9 | 2 | 2 |
| 4/20/2017 | 8 | 2 | 2 |
| 4/20/2017 | 7 | 1 | 1 |
| 4/20/2017 | 1 | 1 | 1 |
| 4/20/2017 | 0 | 1 | 1 |
| 4/19/2017 | 23 | 1 | 1 |
End Goal and Other Options
What I am eventually trying to achieve with this output is to populate a line chart which displays the number of active sessions for either a month, day, or hour (used in the example output) between two dates. In the hour example, the date in combination with the HR would be used along the X-axis and the active sessions would be used along the Y-axis. The distinct user count would be available if a user hovered over the point on the chart. FYI Active sessions are the total number of sessions that were open at any point during the interval. Distinct users are the total number of distinct users during the interval. If I logged on and off twice in the same hour, it would be 2 active sessions, but only 1 distinct user.
Alternative Solutions
This seems to be a problem which must have come up may times before, but from all of my googling and stack overflow research I cannot seem to find the correct approach. If I am thinking about the query or ideal output incorrectly I AM OPEN TO ALTERNATE SUGGESTIONS which allow me to get the desired output to populate the chart appropriately on the front end.
Some SQL I Have Tried (Good Faith Effort)
There are many queries I've tried, but I'll start with this one as it is the closest I got but is extremely slow (unusably so)and it still does not produce the result I need.
Select * FROM (
SELECT
u.YearDt, u.MonthDt, u.DayDt, u.HourDt, u.MinDt,
COUNT(Distinct u.session_id) as unique_sessions,
COUNT(Distinct u.user_name) as unique_users,
LISTAGG(u.user_name, ', ') WITHIN GROUP (ORDER BY u.user_name ASC) as users
FROM
(SELECT EXTRACT(year FROM l.start_date) as YearDt,
EXTRACT(month FROM l.start_date) as MonthDt,
EXTRACT(day FROM l.start_date) as DayDt,
EXTRACT(HOUR FROM CAST(l.start_date AS TIMESTAMP)) as HourDt,
EXTRACT(MINUTE FROM CAST(l.start_date AS TIMESTAMP)) as MinDt,
l.session_id,
l.user_name,
l.start_date as act_date,
1 as is_start
FROM web_session l
UNION ALL
SELECT EXTRACT(year FROM l.stop_date) as YearDt,
EXTRACT(month FROM l.stop_date) as MonthDt,
EXTRACT(day FROM l.stop_date) as DayDt,
EXTRACT(HOUR FROM CAST(l.stop_date AS TIMESTAMP)) as HourDt,
EXTRACT(MINUTE FROM CAST(l.stop_date AS TIMESTAMP)) as MinDt,
l.session_id,
l.user_name,
l.stop_date as act_date,
0 as is_start
FROM web_session l
) u
GROUP BY CUBE ( u.YearDt, u.MonthDt, u.DayDt, u.HourDt, u.MinDt)
) c
Upvotes: 3
Views: 2060
Reputation: 167962
You can use a CTE (Query 1) or a correlated hierarchical query (Query 2) to generate the hours within the time ranges and then aggregate. This only requires a single table scan:
Oracle 11g R2 Schema Setup:
CREATE TABLE Web_Session ( Session_id, user_name, start_date, stop_date ) AS
SELECT 1, 'joe', CAST( TIMESTAMP '2017-04-20 22:42:10' AS DATE ), CAST( TIMESTAMP '2017-04-21 02:42:10' AS DATE ) FROM DUAL UNION ALL
SELECT 2, 'matt', TIMESTAMP '2017-04-20 17:43:10', TIMESTAMP '2017-04-20 17:59:10' FROM DUAL UNION ALL
SELECT 3, 'matt', TIMESTAMP '2017-04-20 15:42:10', TIMESTAMP '2017-04-20 17:42:10' FROM DUAL UNION ALL
SELECT 4, 'joe', TIMESTAMP '2017-04-20 11:20:10', TIMESTAMP '2017-04-20 16:42:10' FROM DUAL UNION ALL
SELECT 5, 'john', TIMESTAMP '2017-04-20 08:42:10', TIMESTAMP '2017-04-20 11:42:10' FROM DUAL UNION ALL
SELECT 6, 'matt', TIMESTAMP '2017-04-20 07:42:10', TIMESTAMP '2017-04-20 23:42:10' FROM DUAL UNION ALL
SELECT 7, 'joe', TIMESTAMP '2017-04-19 23:20:10', TIMESTAMP '2017-04-20 01:42:10' FROM DUAL;
Query 1:
WITH hours ( session_id, user_name, hour, duration ) AS (
SELECT session_id,
user_name,
CAST( TRUNC( start_date, 'HH24' ) AS DATE ),
( TRUNC( stop_date, 'HH24' ) - TRUNC( start_date, 'HH24' ) ) * 24
FROM web_session
UNION ALL
SELECT session_id,
user_name,
hour + INTERVAL '1' HOUR, -- There is a bug in SQLFiddle that subtracts
-- hours instead of adding so -1 is used there.
duration - 1
FROM hours
WHERE duration > 0
)
SELECT hour,
COUNT( session_id ) AS active_sessions,
COUNT( DISTINCT user_name ) AS distinct_users
FROM hours
GROUP BY hour
ORDER BY hour
| HOUR | ACTIVE_SESSIONS | DISTINCT_USERS |
|----------------------|-----------------|----------------|
| 2017-04-19T23:00:00Z | 1 | 1 |
| 2017-04-20T00:00:00Z | 1 | 1 |
| 2017-04-20T01:00:00Z | 1 | 1 |
| 2017-04-20T07:00:00Z | 1 | 1 |
| 2017-04-20T08:00:00Z | 2 | 2 |
| 2017-04-20T09:00:00Z | 2 | 2 |
| 2017-04-20T10:00:00Z | 2 | 2 |
| 2017-04-20T11:00:00Z | 3 | 3 |
| 2017-04-20T12:00:00Z | 2 | 2 |
| 2017-04-20T13:00:00Z | 2 | 2 |
| 2017-04-20T14:00:00Z | 2 | 2 |
| 2017-04-20T15:00:00Z | 3 | 2 |
| 2017-04-20T16:00:00Z | 3 | 2 |
| 2017-04-20T17:00:00Z | 3 | 1 |
| 2017-04-20T18:00:00Z | 1 | 1 |
| 2017-04-20T19:00:00Z | 1 | 1 |
| 2017-04-20T20:00:00Z | 1 | 1 |
| 2017-04-20T21:00:00Z | 1 | 1 |
| 2017-04-20T22:00:00Z | 2 | 2 |
| 2017-04-20T23:00:00Z | 2 | 2 |
| 2017-04-21T00:00:00Z | 1 | 1 |
| 2017-04-21T01:00:00Z | 1 | 1 |
| 2017-04-21T02:00:00Z | 1 | 1 |
Execution Plan:
-------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost | Time |
-------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 14 | 364 | 7 | 00:00:01 |
| 1 | SORT GROUP BY | | 14 | 364 | 7 | 00:00:01 |
| 2 | VIEW | VW_DAG_0 | 14 | 364 | 7 | 00:00:01 |
| 3 | HASH GROUP BY | | 14 | 364 | 7 | 00:00:01 |
| 4 | VIEW | | 14 | 364 | 6 | 00:00:01 |
| 5 | UNION ALL (RECURSIVE WITH) BREADTH FIRST | | | | | |
| 6 | TABLE ACCESS FULL | WEB_SESSION | 7 | 245 | 3 | 00:00:01 |
| * 7 | RECURSIVE WITH PUMP | | | | | |
-------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
------------------------------------------
* 7 - filter("DURATION">0)
Note
-----
- dynamic sampling used for this statement
Query 2:
SELECT t.COLUMN_VALUE AS hour,
COUNT( session_id ) AS active_sessions,
COUNT( DISTINCT user_name ) AS distinct_users
FROM web_session w
CROSS JOIN
TABLE(
CAST(
MULTISET(
SELECT TRUNC( w.start_date, 'HH24' ) + ( LEVEL - 1 ) / 24
FROM DUAL
CONNECT BY TRUNC( w.start_date, 'HH24' ) + ( LEVEL - 1 ) / 24 < w.stop_date
) AS SYS.ODCIDATELIST
)
) t
GROUP BY t.COLUMN_VALUE
ORDER BY hour
| HOUR | ACTIVE_SESSIONS | DISTINCT_USERS |
|----------------------|-----------------|----------------|
| 2017-04-19T23:00:00Z | 1 | 1 |
| 2017-04-20T00:00:00Z | 1 | 1 |
| 2017-04-20T01:00:00Z | 1 | 1 |
| 2017-04-20T07:00:00Z | 1 | 1 |
| 2017-04-20T08:00:00Z | 2 | 2 |
| 2017-04-20T09:00:00Z | 2 | 2 |
| 2017-04-20T10:00:00Z | 2 | 2 |
| 2017-04-20T11:00:00Z | 3 | 3 |
| 2017-04-20T12:00:00Z | 2 | 2 |
| 2017-04-20T13:00:00Z | 2 | 2 |
| 2017-04-20T14:00:00Z | 2 | 2 |
| 2017-04-20T15:00:00Z | 3 | 2 |
| 2017-04-20T16:00:00Z | 3 | 2 |
| 2017-04-20T17:00:00Z | 3 | 1 |
| 2017-04-20T18:00:00Z | 1 | 1 |
| 2017-04-20T19:00:00Z | 1 | 1 |
| 2017-04-20T20:00:00Z | 1 | 1 |
| 2017-04-20T21:00:00Z | 1 | 1 |
| 2017-04-20T22:00:00Z | 2 | 2 |
| 2017-04-20T23:00:00Z | 2 | 2 |
| 2017-04-21T00:00:00Z | 1 | 1 |
| 2017-04-21T01:00:00Z | 1 | 1 |
| 2017-04-21T02:00:00Z | 1 | 1 |
Execution Plan:
--------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost | Time |
--------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 57176 | 2115512 | 200 | 00:00:03 |
| 1 | SORT GROUP BY | | 57176 | 2115512 | 200 | 00:00:03 |
| 2 | NESTED LOOPS | | 57176 | 2115512 | 195 | 00:00:03 |
| 3 | TABLE ACCESS FULL | WEB_SESSION | 7 | 245 | 3 | 00:00:01 |
| 4 | COLLECTION ITERATOR SUBQUERY FETCH | | 8168 | 16336 | 27 | 00:00:01 |
| * 5 | CONNECT BY WITHOUT FILTERING | | | | | |
| 6 | FAST DUAL | | 1 | | 2 | 00:00:01 |
--------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
------------------------------------------
* 5 - filter(TRUNC(:B1,'fmhh24')+(LEVEL-1)/24<:B2)
Note
-----
- dynamic sampling used for this statement
Upvotes: 2
Reputation: 6449
Matt,
What you need to do is generate a time dimension either as a static table or dynamically at run time:
create table time_dim (
ts date primary key,
year number not null,
month number not null,
day number not null,
wday number not null,
dy varchar2(3) not null,
hr number not null
);
insert into time_dim (ts, year, month, day, wday, dy, hr)
select ts
, extract(year from ts) year
, extract(month from ts) month
, extract(day from ts) day
, to_char(ts,'d') wday
, to_char(ts,'dy') dy
, to_number(to_char(ts,'HH24')) hr
from (
select DATE '2017-01-01' + (level - 1)/24 ts
FROM DUAL connect by level <= 365*24) a;
Then outer join that to your web_sessions
table:
select t.ts, t.year, t.month, t.wday, t.dy, t.hr
, count(session_id) sessions
, count(distinct user_name) users
from time_dim t
left join web_session w
on t.ts between trunc(w.start_date, 'hh24') and w.stop_date
where trunc(t.ts) between date '2017-04-19' and date '2017-04-21'
group by rollup (t.year, t.month, (t.wday, t.dy), (t.hr, t.ts));
You can change up the group by clause to get the various aggregates you're interested in.
In the above code, I'm truncating the start_date to the hour in the ON
clause so that the start hour will be included in the results otherwise sessions that don't start exactly at the top of the hour would not get counted in that hour.
Upvotes: 1
Reputation: 12485
I think something like this will work:
WITH ct ( active_dt ) AS (
-- Build the query for the "table" of hours
SELECT DATE'2018-04-19' + (LEVEL-1)/24 AS active_dt FROM dual
CONNECT BY DATE'2018-04-19' + (LEVEL-1)/24 < DATE'2018-04-22'
)
SELECT active_dt AS "Date", active_hr AS "HR"
, COUNT(session_id) AS active_sessions
, COUNT(DISTINCT user_name) AS distinct_users
FROM (
SELECT TRUNC(ct.active_dt) AS active_dt
, TO_CHAR(ct.active_dt, 'HH24') AS active_hr
, ws.session_id, ws.user_name
FROM ct LEFT JOIN web_session ws
ON ct.active_dt + 1/24 >= ws.start_dt
AND ct.active_dt < ws.stop_dt
) GROUP BY active_dt, active_hr
ORDER BY active_dt DESC, active_hr DESC;
I may not have the conditions for the LEFT JOIN
100% correct.
Hope this helps.
Upvotes: 1