Alvin Chang
Alvin Chang

Reputation: 3

Identifying the pairs of ID's in a column with the highest number of matches in SQL

I am trying to find the pairs of businesses with the highest number of common customers using MySQL.

The table is like the following:

+------------+------------+ 
| BusinessID | CustomerID |
+------------+------------+
| A          |          1 |
| A          |          2 |
| A          |          3 |
| B          |          4 |
| B          |          1 |
| B          |          3 |
| B          |          2 |
| C          |          3 |
| C          |          4 |
| C          |          5 |
+------------+------------+

And I want the output to be the pairs of businesses and the number of common customers, like this:

+-------------+-------------+------------------------+
| BusinessID  | BusinessID  | Common Customers Count |
+-------------+-------------+------------------------+
| A           | B           |                      3 |
| A           | C           |                      1 |
| B           | C           |                      2 |
+-------------+-------------+------------------------+

This is the query I wrote:

SELECT a.BusinessID,b.BusinessID,COUNT(*) AS ncom
FROM (SELECT BusinessID, CustomerID FROM MYTABLE) AS a JOIN       
     (SELECT BusinessID,CustomerID FROM MYTABLE) AS b 
     ON a.BusinessID < b.BusinessID AND a.CustomerID = b.CustomerID
GROUP BY a.BusinessID, b.BusinessID
ORDER BY ncom   

The problem is that my dataset has about 5m rows, and this seems to be too inefficient on large datasets. I tested the query on smaller datasets by limiting the data -- it took 8 seconds to process 10k rows and 30 seconds for 20k rows, so this query wouldn't be feasible to run for 5m rows. How else can I write the query to make it faster?

Upvotes: 0

Views: 33

Answers (1)

Barmar
Barmar

Reputation: 782775

Don't use subqueries to get the columns from the table, that's probably preventing it from using indexes.

SELECT a.BusinessID, b.BusinessID, COUNT(*) as ncom
FROM MYTABLE AS a
JOIN MYTABLE AS b ON a.BusinessID < b.BusinessID AND a.CustomerID = b.CustomerID
GROUP BY a.BusinessID, b.BusinessID
ORDER BY ncom

Also, give the table the following index:

CREATE INDEX ix_cust_bus ON MYTABLE (CustomerID, BusinessID);

Upvotes: 1

Related Questions