Reputation: 3
I am trying to find the pairs of businesses with the highest number of common customers using MySQL.
The table is like the following:
+------------+------------+
| BusinessID | CustomerID |
+------------+------------+
| A | 1 |
| A | 2 |
| A | 3 |
| B | 4 |
| B | 1 |
| B | 3 |
| B | 2 |
| C | 3 |
| C | 4 |
| C | 5 |
+------------+------------+
And I want the output to be the pairs of businesses and the number of common customers, like this:
+-------------+-------------+------------------------+
| BusinessID | BusinessID | Common Customers Count |
+-------------+-------------+------------------------+
| A | B | 3 |
| A | C | 1 |
| B | C | 2 |
+-------------+-------------+------------------------+
This is the query I wrote:
SELECT a.BusinessID,b.BusinessID,COUNT(*) AS ncom
FROM (SELECT BusinessID, CustomerID FROM MYTABLE) AS a JOIN
(SELECT BusinessID,CustomerID FROM MYTABLE) AS b
ON a.BusinessID < b.BusinessID AND a.CustomerID = b.CustomerID
GROUP BY a.BusinessID, b.BusinessID
ORDER BY ncom
The problem is that my dataset has about 5m rows, and this seems to be too inefficient on large datasets. I tested the query on smaller datasets by limiting the data -- it took 8 seconds to process 10k rows and 30 seconds for 20k rows, so this query wouldn't be feasible to run for 5m rows. How else can I write the query to make it faster?
Upvotes: 0
Views: 33
Reputation: 782775
Don't use subqueries to get the columns from the table, that's probably preventing it from using indexes.
SELECT a.BusinessID, b.BusinessID, COUNT(*) as ncom
FROM MYTABLE AS a
JOIN MYTABLE AS b ON a.BusinessID < b.BusinessID AND a.CustomerID = b.CustomerID
GROUP BY a.BusinessID, b.BusinessID
ORDER BY ncom
Also, give the table the following index:
CREATE INDEX ix_cust_bus ON MYTABLE (CustomerID, BusinessID);
Upvotes: 1