idanshmu
idanshmu

Reputation: 5251

Consider a single record, per id, in a group by

Background

I have an SQL table with 4 columns:

I have this query:

SELECT g1,
       COUNT(DISTINCT id),
       SUM(COUNT(DISTINCT id)) OVER () AS total,
       (CAST(COUNT(DISTINCT id) AS float) / SUM(COUNT(DISTINCT id)) OVER ()) AS share
FROM my_table
and g2 = 'start'
GROUP BY 1
order by share desc

This query was built to answer: What is the distributions of g1 value out of the users?

Problem

Each id may have multiple records in the table. I wish to consider the earliest one. early means, the minimum datetime value.

Example

Table

id    g1    g2      datetime
x1    a     start   2016-01-19 21:01:22
x1    c     start   2016-01-19 21:01:21
x2    b     start   2016-01-19 09:03:42
x1    a     start   2016-01-18 13:56:45

Actual query results

g1  count   total   share
a   2       4       0.5
b   1       4       0.25
c   1       4       0.25

we have 4 records, but I only want to consider two records:

x2    b     start   2016-01-19 09:03:42
x1    a     start   2016-01-18 13:56:45

which are the earliest records per id.

Expected query results

g1  count   total   share
a   1       2       0.5
b   1       2       0.5

Question

How do I consider only the earliest record, per id, in the group by

Upvotes: 6

Views: 101

Answers (4)

Ely
Ely

Reputation: 11152

You are querying from my_table all the data although you only want to have the earliest date for an id. I assume id is the primary key in the table.

I suggest you define a view (or inline view) which queries only the earliest dates for the id's and you use your query on that view instead of on my_table.

The view could be defined as so and would contain only id's of earliest date:

select * from my_table a 
where a.datetime = (select min(z.datetime) from my_table z where a.id = z.id) and a.g2 = 'start'

You can define that as a view or use it directly inline as in:

SELECT g1,
       COUNT(DISTINCT id),
       SUM(COUNT(DISTINCT id)) OVER () AS total,
       (CAST(COUNT(DISTINCT id) AS float) / SUM(COUNT(DISTINCT id)) OVER ()) AS share
FROM (select a.id, a.g1, a.g2, a.datetime from my_table a where a.datetime = (select min(z.datetime) from my_table z where a.id = z.id) and a.g2 = 'start')
GROUP BY 1
order by share desc

Upvotes: 1

Unnikrishnan R
Unnikrishnan R

Reputation: 5031

Try with the below query.

;WITH cte_1
   as (SELECT id, MIN(datetime) AS [Date]
     FROM YourTable 
     GROUP BY id
     ) 
     SELECT yt.g1,
            COUNT(DISTINCT yt.id) [Count],
            SUM(COUNT(DISTINCT yt.id)) OVER () AS total,
            (CAST(COUNT(DISTINCT yt.id) AS float) / SUM(COUNT(DISTINCT yt.id)) OVER ()) AS share
     FROM cte_1 c
       JOIN YourTable  yt
      ON yt.[datetime]=c.[Date] AND yt.id=c.id
and yt.g2 = 'start'
GROUP BY yt.g1
ORDER BY share DESC

Output :

enter image description here

Upvotes: 2

Thomas G
Thomas G

Reputation: 10206

I don't know what is your DBMS so here's a standard ANSI way to do this

SELECT T1.g1,
       COUNT(DISTINCT id),
       SUM(COUNT(DISTINCT id)) OVER () AS total,
       (CAST(COUNT(DISTINCT id) AS float) / SUM(COUNT(DISTINCT id)) OVER ()) AS share
FROM my_table T1
INNER JOIN 
    (SELECT id, MIN(datetime) AS mindt
     FROM mytable 
     GROUP BY id
     ) T2 ON T1.datetime=t2.mindt AND T1.id=T2.id
and T1.g2 = 'start'
GROUP BY 1
order by share desc

It might be slow if you have a large table and datetime is not indexed.

Upvotes: 2

Tim Biegeleisen
Tim Biegeleisen

Reputation: 520918

Here is a solution which should work in SQL Server, and any database which supports CTE:

WITH cte AS
(
    SELECT t1.g1,
           COUNT(*) AS count
    FROM yourTable t1
    INNER JOIN
    (
        SELECT id, MIN(datetime) AS datetime
        FROM yourTable
        GROUP BY id
    ) t2
        ON t1.id = t2.id AND
           t1.datetime = t2.datetime
)

SELECT t.g1,
       t.count,
       (SELECT COUNT(*) FROM cte) AS total,
       t.count / (SELECT COUNT(*) FROM cte) AS share
FROM cte t

Upvotes: 2

Related Questions