John Verrone
John Verrone

Reputation: 307

SQL remove duplicate rows based on other rows

I have two tables (one that shows a record of all the programs sent to a customer, and another that shows a record of all the programs the user took action on). My ultimate goal is to count how many of those programs that were sent to the user were acted upon. I have a way of doing that right now but there is a problem. The programs may be sent to the user multiple times (I will use 20 for an example) within a single day. If the user takes action on that program within 24 hours, I don't want it to count as 19 programs that weren't acted on and 1 program that was acted on. I want it to count as 1 acted upon program with 0 fails.

The table that contains the programs sent to the customer needs to be filtered down so that each program that is sent to the customer only shows once per 24 hour period.

My solution is to have a "window" type thing so that once a program is sent to a customer, it is 'locked' for that customer for 24 hours so it won't appear in my 'programs sent to customer' query multiple times.

I have a table like this:

Customer    Time    Program
-----------------------------------
1           8:05    a
1           10:30   a
1           11:30   a
1           12:30   b
1           1:25    a
2           9:38    b
2           10:38   c
2           1:36    c
2           2:40    c
2           3:41    b
.
.
.

I want to get a table (query, not delete) that removes duplicate programs per customer within a certain time frame (confusing! I know)

Here is what I want (with a 3hr time frame for example):

Customer    Time    Program
-----------------------------------
1           8:05    a
1           11:30   a
1           12:30   b
2           9:38    b
2           10:38   c
2           2:40    c
2           3:41    b
.
.
.

Upvotes: 3

Views: 633

Answers (2)

ypercubeᵀᴹ
ypercubeᵀᴹ

Reputation: 115510

I think that what you mean can be solved (only) with a recursive query. Here is a solution:

WITH cte AS
  ( SELECT 
        customer, program, time,
        ROW_NUMBER() 
            OVER (PARTITION BY customer, program
                  ORDER BY time)
          AS rn, 
        MIN(time) 
            OVER (PARTITION BY customer, program
                  ORDER BY time 
                  RANGE BETWEEN 3.0/24 FOLLOWING 
                            AND UNBOUNDED FOLLOWING)
          AS next_time
    FROM a
  )

SELECT 
    customer, time, program
FROM 
    cte
START WITH rn = 1
CONNECT BY PRIOR customer = customer
       AND PRIOR program = program
       AND PRIOR next_time = time
ORDER BY 
    customer, time, program ;

You can also replace MIN(time) with FIRST_VALUE(time) above and get the same results. It might be more efficient.

Tested at SQL-Fiddle

The efficiency on a big table is probably not going to be very good. You can try running the query with smaller set of data.

And you should at least add this index, so it does an index scan:

CREATE INDEX ix                -- choose a name for the index
  ON tableX                    -- the table name
  (customer, program, time) ;

You could also skip the final ordering or change it so it is more similar to the used index:

ORDER BY 
    customer, program, time ;

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1269443

Try this:

select *
from t
where not exists (select 1 from t t2
                  where t2.customer = t.customer and
                        t2.program = t.program and
                        t2.time - t.time < 3.0/24 and
                        t2.time > t.time
                 )

Date time arithmetic depends on databases, but this works for many of them.

Upvotes: 2

Related Questions