Cross join, on multiple columns, without duplicates

Question

We have two tables with a mostly unique email, and a date where a transaction was sent (from one system) and received (in another system):

CREATE TABLE `alpha` (
  `id` int(11) NOT NULL,
  `email` varchar(255) NOT NULL,
  `date_sent` datetime NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

INSERT INTO `alpha`
VALUES
(12344,'loremipsum@example.com','2013-01-01 02:26:04'),
(12345,'foobar@example.com','2013-01-01 04:39:16'),
(12346,'foobar@example.com','2013-01-01 04:43:18');

CREATE TABLE `bravo` (
  `id` int(11) NOT NULL,
  `email` varchar(60) DEFAULT NULL,
  `date_recvd` datetime DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

INSERT INTO `bravo`
VALUES
(98764,'loremipsum@example.com','2013-01-01 03:29:12'),
(98765,'foobar@example.com','2013-01-01 05:42:08'),
(98766,'foobar@example.com','2013-01-01 05:46:08');

With a simple join on email and m/d/y of the date:

select a.id, a.date_sent, b.id, b.date_recvd
from alpha a inner join bravo b
  on a.email = b.email and date_format(a.date_sent,'%m/%d/%Y') = date_format(b.date_recvd,'%m/%d/%Y')

We get every permutation of email+date:

| a.id  | a.date_sent         | b.id  | b.date_recvd        |
+-------+---------------------+-------+---------------------+
| 12344 | 2013-01-01 02:26:04 | 98764 | 2013-01-01 03:29:12 |
| 12345 | 2013-01-01 04:39:16 | 98765 | 2013-01-01 05:42:08 |
| 12346 | 2013-01-01 04:43:18 | 98765 | 2013-01-01 05:42:08 |
| 12345 | 2013-01-01 04:39:16 | 98766 | 2013-01-01 05:46:08 |
| 12346 | 2013-01-01 04:43:18 | 98766 | 2013-01-01 05:46:08 |

What we want is something more like this, where we join firstly on the email, and then only the dates in an order that they sort of line up:

| a.id  | a.date_sent         | b.id  | b.date_recvd        |
+-------+---------------------+-------+---------------------+
| 12344 | 2013-01-01 02:26:04 | 98764 | 2013-01-01 03:29:12 |
| 12345 | 2013-01-01 04:39:16 | 98765 | 2013-01-01 05:42:08 |
| 12346 | 2013-01-01 04:43:18 | 98766 | 2013-01-01 05:46:08 |

But I'm not even certain how to approach this?

Clarification: What we'd like to do is, emails being equal, eliminate the duplicates so that the date gaps are smallest.

Frazz · Accepted Answer

Under certain conditions the following query will provide the results you want:

SELECT an.*, bn.*
FROM 
  (SELECT a.*,
    (CASE a.email
       WHEN @curEmail THEN @i:=@i+1
       ELSE                @i:=1 AND @curEmail:=a.email
    END) AS rn
  FROM (SELECT @i:=0, @curEmail:='') foo, (SELECT * FROM alpha ORDER BY email, date_sent) a) an
JOIN
  (SELECT b.*,
    (CASE b.email
       WHEN @curEmail THEN @i:=@i+1
       ELSE                @i:=1 AND @curEmail:=b.email
    END) AS rn
  FROM (SELECT @i:=0, @curEmail:='') foo, (SELECT * FROM bravo ORDER BY email, date_recvd) b) bn
ON an.email=bn.email AND an.rn=bn.rn;

With the limited data you provided, this works. You can see it here: SQLFiddle

What this is doing is:

Adding an rn column to alpha... this is some sort of row numbering within all rows with the same email, sorted by date_sent
Adding an rn column to bravo... same as above
JOINing the two result sets on email and rn

This will work ONLY if alpha and bravo contain good data that matches well.

The conditions are quite strict, especially on the bravo table. In particular, bravo should not contain and early rows... rows that match email with alpha, but have date_recvd less than the first alpha date_sent (with same email).

You could elaborate on this and work out a more complex version that works on email, date (day only) and rownumber... as you suggested in your question. But I don't think this is a good solution. I see you have significant gaps between date_sent and date_recvd. If the gaps roll over midnight you will not be able to match rows correctly.

Cross join, on multiple columns, without duplicates

Answers (1)

Related Questions