Reputation: 25
We have two tables with a mostly unique email, and a date where a transaction was sent (from one system) and received (in another system):
CREATE TABLE `alpha` (
`id` int(11) NOT NULL,
`email` varchar(255) NOT NULL,
`date_sent` datetime NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
INSERT INTO `alpha`
VALUES
(12344,'[email protected]','2013-01-01 02:26:04'),
(12345,'[email protected]','2013-01-01 04:39:16'),
(12346,'[email protected]','2013-01-01 04:43:18');
CREATE TABLE `bravo` (
`id` int(11) NOT NULL,
`email` varchar(60) DEFAULT NULL,
`date_recvd` datetime DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
INSERT INTO `bravo`
VALUES
(98764,'[email protected]','2013-01-01 03:29:12'),
(98765,'[email protected]','2013-01-01 05:42:08'),
(98766,'[email protected]','2013-01-01 05:46:08');
With a simple join on email and m/d/y of the date:
select a.id, a.date_sent, b.id, b.date_recvd
from alpha a inner join bravo b
on a.email = b.email and date_format(a.date_sent,'%m/%d/%Y') = date_format(b.date_recvd,'%m/%d/%Y')
We get every permutation of email+date:
| a.id | a.date_sent | b.id | b.date_recvd |
+-------+---------------------+-------+---------------------+
| 12344 | 2013-01-01 02:26:04 | 98764 | 2013-01-01 03:29:12 |
| 12345 | 2013-01-01 04:39:16 | 98765 | 2013-01-01 05:42:08 |
| 12346 | 2013-01-01 04:43:18 | 98765 | 2013-01-01 05:42:08 |
| 12345 | 2013-01-01 04:39:16 | 98766 | 2013-01-01 05:46:08 |
| 12346 | 2013-01-01 04:43:18 | 98766 | 2013-01-01 05:46:08 |
What we want is something more like this, where we join firstly on the email, and then only the dates in an order that they sort of line up:
| a.id | a.date_sent | b.id | b.date_recvd |
+-------+---------------------+-------+---------------------+
| 12344 | 2013-01-01 02:26:04 | 98764 | 2013-01-01 03:29:12 |
| 12345 | 2013-01-01 04:39:16 | 98765 | 2013-01-01 05:42:08 |
| 12346 | 2013-01-01 04:43:18 | 98766 | 2013-01-01 05:46:08 |
But I'm not even certain how to approach this?
Clarification: What we'd like to do is, emails being equal, eliminate the duplicates so that the date gaps are smallest.
Upvotes: 1
Views: 3709
Reputation: 3043
Under certain conditions the following query will provide the results you want:
SELECT an.*, bn.*
FROM
(SELECT a.*,
(CASE a.email
WHEN @curEmail THEN @i:=@i+1
ELSE @i:=1 AND @curEmail:=a.email
END) AS rn
FROM (SELECT @i:=0, @curEmail:='') foo, (SELECT * FROM alpha ORDER BY email, date_sent) a) an
JOIN
(SELECT b.*,
(CASE b.email
WHEN @curEmail THEN @i:=@i+1
ELSE @i:=1 AND @curEmail:=b.email
END) AS rn
FROM (SELECT @i:=0, @curEmail:='') foo, (SELECT * FROM bravo ORDER BY email, date_recvd) b) bn
ON an.email=bn.email AND an.rn=bn.rn;
With the limited data you provided, this works. You can see it here: SQLFiddle
What this is doing is:
This will work ONLY if alpha and bravo contain good data that matches well.
The conditions are quite strict, especially on the bravo table. In particular, bravo should not contain and early rows... rows that match email with alpha, but have date_recvd less than the first alpha date_sent (with same email).
You could elaborate on this and work out a more complex version that works on email, date (day only) and rownumber... as you suggested in your question. But I don't think this is a good solution. I see you have significant gaps between date_sent and date_recvd. If the gaps roll over midnight you will not be able to match rows correctly.
Upvotes: 1