Reputation: 131

What's the most efficient query?

I have a table named Projects that has the following relationships:

has many Contributions has many Payments

In my result set, I need the following aggregate values:

Number of unique contributors (DonorID on the Contribution table)
Total contributed (SUM of Amount on Contribution table)
Total paid (SUM of PaymentAmount on Payment table)

Because there are so many aggregate functions and multiple joins, it gets messy do use standard aggregate functions the the GROUP BY clause. I also need the ability to sort and filter these fields. So I've come up with two options:

Using subqueries:

SELECT Project.ID AS PROJECT_ID,
(SELECT SUM(PaymentAmount) FROM Payment WHERE ProjectID = PROJECT_ID) AS TotalPaidBack,
(SELECT COUNT(DISTINCT DonorID) FROM Contribution WHERE RecipientID = PROJECT_ID) AS ContributorCount,
(SELECT SUM(Amount) FROM Contribution WHERE RecipientID = PROJECT_ID) AS TotalReceived
FROM Project;

Using a temporary table:

DROP TABLE IF EXISTS Project_Temp;
CREATE TEMPORARY TABLE Project_Temp (project_id INT NOT NULL, total_payments INT, total_donors INT, total_received INT, PRIMARY KEY(project_id)) ENGINE=MEMORY;
INSERT INTO Project_Temp (project_id,total_payments)
 SELECT `Project`.ID, IFNULL(SUM(PaymentAmount),0) FROM `Project` LEFT JOIN `Payment` ON ProjectID = `Project`.ID GROUP BY 1;
INSERT INTO Project_Temp (project_id,total_donors,total_received)
 SELECT `Project`.ID, IFNULL(COUNT(DISTINCT DonorID),0), IFNULL(SUM(Amount),0) FROM `Project` LEFT JOIN `Contribution` ON RecipientID = `Project`.ID  GROUP BY 1
 ON DUPLICATE KEY UPDATE total_donors = VALUES(total_donors), total_received = VALUES(total_received);

SELECT * FROM Project_Temp;

Tests for both are pretty comparable, in the 0.7 - 0.8 seconds range with 1,000 rows. But I'm really concerned about scalability, and I don't want to have to re-engineer everything as my tables grow. What's the best approach?

Upvotes: 1

Answers (4)

Tyler Dumont

Reputation: 11

A few thoughts:

The derived table idea would be good on other platforms, but MySQL has the same issue with derived tables that it does with views: they aren't indexed. That means that MySQL will execute the full content of the derived table before applying the WHERE clause, which doesn't scale at all.
Option 1 is good for being compact, but syntax might get tricky when you want to start putting the derived expressions in the WHERE clause.
The suggestion of materialized views is a good one, but MySQL unfortunately doesn't support them. I like the idea of using triggers. You could translate that temporary table into a real table that persists, and then use INSERT/UPDATE/DELETE triggers on the Payments and Contribution tables to update the Project Stats table.
Finally, if you don't want to mess with triggers, and if you aren't too concerned with freshness, you can always have the separate stats table and update it offline, having a cron job that runs every few minutes that does the work that you specified in Query #2 above, except on the real table. Depending on the nuances of your application, this slight delay in updating the stats may or may not be acceptable to your users.

Upvotes: 1

Thomas

Reputation: 64655

There is a third option which is derived tables:

Select Project.ID AS PROJECT_ID
    , Payments.Total AS TotalPaidBack
    , Coalesce(ContributionStats.DonarCount, 0) As ContributorCount
    , ContributionStats.Total As TotalReceived
From Project
    Left Join   (
                Select C1.RecipientId, Sum(C1.Amount) As Total, Count(Distinct C1.DonarId) ContributorCount
                From Contribution As C1
                Group By C1.RecipientId
                ) As ContributionStats
        On ContributionStats.RecipientId = Project.Project_Id
    Left Join   (
                Select P1.ProjectID, Sum(P1.PaymentAmount) As Total
                From Payment As P1
                Group By P1.RecipientId
                ) As Payments
        On Payments.ProjectId = Project.Project_Id

I'm not sure if it will perform better, but you might give it shot.

Upvotes: 1

mdma

Reputation: 57757

I would go with the first approach. You are allowing the RDBMS to do it's job, rather than trying to do it's job for it.

By creating a temp table, you will always create the full table for each query. If you only want data for one project, you still end up creating the full table (unless you restrict each INSERT statement accordingly.) Sure, you can code it, but it's already becoming a fair amount code and complexity for a small performance gain.

With a SELECT, the db can fetch the appriate amount of data, optimizing the whole query based on context. If other users have queried the same data, it may even be cached (query, and possibly data, depending upon your db). If performance is truly a concern, you might consider using Indexed/Materialized Views, or generating a table on an INSERT/UPDATE/DELETE trigger. Scaling out, you can use server clusters and partioned views - something that I believe will be difficult if you are creating temporary tables.

EDIT: the above is written without any specific rdbms in mind, although the OP added that mysql is the target db.

Upvotes: 1

duffymo

Reputation: 308928

Knowing the timing for each 1K rows is good, but the real question is how they'll be used.

Are you planning to send all these back to a UI? Google doles out results 25 per page; maybe you should, too.

Are you planning to do calculations in the middle tier? Maybe you can do those calculations on the database and save yourself bringing all those bytes across the wire.

My point is that you may never need to work with 1,000 or one million rows if you think carefully about what you do with them.

You can EXPLAIN PLAN to see what the difference between the two queries is.

Upvotes: 2

What&#39;s the most efficient query?

Answers (4)

Related Questions

What's the most efficient query?