Reputation: 65
I would like some help on this situation. I have a table with UUID (unique), email (repeated), timestamp (unique) and has_sales (can be 1 if yes and 0 if no)
Sample data
uuid email timestamp has_sales
1 [email protected] 2016-10-02 10:28:23 0
2 [email protected] 2017-10-03 10:28:23 0
3 [email protected] 2017-10-06 17:08:15 1
4 [email protected] 2017-12-04 20:47:17 0
5 [email protected] 2018-05-21 15:27:04 0
6 [email protected] 2016-10-02 10:28:23 1
7 [email protected] 2017-10-03 10:28:23 0
I would like to choose the oldest timestamp, except when there is a sale on a newer one (it is rare, but it may occur). So, the expected result would be
uuid email timestamp has_sales
3 [email protected] 2017-10-06 17:08:15 1
6 [email protected] 2016-10-02 10:28:23 1
Currently, I am just using the first condition (oldest timestamp), as follows:
SELECT
dm1.uuid,
dm1.email,
dm1.timestamp,
dm1.has_sales
FROM dup_mail dm1
where
time_stamp = (select min(time_stamp)
from dup_mail dm2
where dm1.email = dm2.email
)
order by 2
How, upgrading this code, could I add the condition of if there is sale to a newer user and no sale to the older one, I would choose the newer one? Each email is related to either no sales (0 in all duplicate accounts) or yes sale (1 in one of the duplicate accounts and 0 in the other ones). Even if there is more than one duplicate account with sales, I just want to know if there was a sale or not
Upvotes: 0
Views: 2617
Reputation: 108410
The correlated subquery could be rewritten
SELECT dm2.timestamp
FROM dup_mail dm2
WHERE dm2.email = dm1.email
ORDER
BY dm2.has_sales DESC
, dm2.timestamp ASC
LIMIT 1
This is going to sort the rows with has_sales=1
before rows with has_sales=0
, and then by timestamp
. The LIMIT 1
clause picks the first row (after the set is sorted.)
We are going to want a suitable index on dup_mail
table with email
as the leading column. Including timestamp
and has_sales
columns in the index would make it a covering index for the subquery.
This should satisfy the specification, but the correlated subquery may not be optimal in terms of performance.
SELECT dm1.uuid
, dm1.email
, dm1.timestamp
, dm1.has_sales
FROM dup_mail dm1
WHERE dm1.timestamp =
( SELECT dm2.timestamp
FROM dup_mail dm2
WHERE dm2.email = dm1.email
ORDER
BY dm2.has_sales DESC
, dm2.timestamp ASC
LIMIT 1
)
ORDER
BY ...
(It's a bit odd that timestamp would be unique across all of the rows; but if it is, then this query could work.)
We might get better performance with something like this:
SELECT dmx.email
, IF( MAX(dmx.has_sales)=0
, MIN(dmx.timestamp)
, MIN(IF(dmx.has_sales=1,dmx.timestamp,NULL))
) AS min_timestamp
FROM dup_email dmx
GROUP BY dmx.email
And then using that as an inline view and join to the dup_mail
table to get the rows associated with the minimum timestamp
SELECT dm1.uuid
, dm1.email
, dm1.timestamp
, dm1.has_sales
FROM ( -- minimum timestamp for each email
SELECT dmx.email
, IF( MAX(dmx.has_sales)=0
, MIN(dmx.timestamp)
, MIN(IF(dmx.has_sales=1,dmx.timestamp,NULL))
) AS min_timestamp
FROM dup_email dmx
GROUP BY dmx.email
) m
JOIN dup_email dm1
ON dm1.email = m.email
AND dm1.timestamp = m.min_timestamp
ORDER
BY ...
NOTE
SQL syntax given above is specific to MySQL (the question is tagged as MySQL).
I think the IF()
function is a MySQL-only extension.
For PostgreSQL, replace this:
, IF( MAX(dmx.has_sales)=0
, MIN(dmx.timestamp)
, MIN(IF(dmx.has_sales=1,dmx.timestamp,NULL))
) AS min_timestamp
with a more portable, more ANSI standards compliant
, CASE WHEN MAX(dmx.has_sales) = 0
THEN MIN(dmx.timestamp)
ELSE MIN( CASE WHEN dmx.has_sales = 1
THEN dmx.timestamp
END
)
END AS min_timestamp
Upvotes: 2