Left outer join two levels deep in Postgres results in cartesian product

Given the following 4 tables:

CREATE TABLE events ( id, name )
CREATE TABLE profiles ( id, event_id )
CREATE TABLE donations ( amount, profile_id )
CREATE TABLE event_members( id, event_id, user_id )

I'm attempting to get a list of all events, along with a count of any members, and a sum of any donations. The issue is the sum of donations is coming back wrong (appears to be a cartesian result of donations * # of event_members).

Here is the SQL query (Postgres)

SELECT events.name, COUNT(DISTINCT event_members.id), SUM(donations.amount)
FROM            events
LEFT OUTER JOIN profiles      ON events.id = profiles.event_id
LEFT OUTER JOIN donations     ON donations.profile_id = profiles.id
LEFT OUTER JOIN event_members ON event_members.event_id = events.id
GROUP BY events.name

The sum(donations.amount) is coming back = to the actual sum of donations * number of rows in event_members. If I comment out the count(distinct event_members.id) and the event_members left outer join, the sum is correct.

Upvotes: 5

Answers (4)

Erwin Brandstetter

Reputation: 656706

As I explained in an answer to the referenced question you need to aggregate before joining to avoid a proxy CROSS JOIN. Like:

SELECT e.name, e.sum_donations, m.ct_members
FROM (
    SELECT e.id AS event_id, e.name, SUM(d.amount) AS sum_donations
    FROM   events         e
    LEFT   JOIN profiles  p ON p.event_id = e.id
    LEFT   JOIN donations d ON d.profile_id = p.id
    GROUP  BY 1, 2
    ) e
LEFT   JOIN (
    SELECT m.event_id, count(DISTINCT m.id) AS ct_members
    FROM   event_members m
    GROUP  BY 1
    ) m USING (event_id);

IF event_members.id is the primary key, then id is guaranteed to be UNIQUE in the table and you can drop DISTINCT from the count:

count(*) AS ct_members

Upvotes: 5

biziclop

Reputation: 14596

You seem to have this two independent structures (-[ means 1-N association):

events -[ profiles -[ donations
events -[ event members

I wrapped the second one into a subquery:

SELECT events.name,
  member_count.the_member_count
  COUNT(DISTINCT event_members.id),
  SUM(donations.amount)

FROM            events
LEFT OUTER JOIN profiles      ON events.id = profiles.event_id
LEFT OUTER JOIN donations     ON donations.profile_id = profiles.id

LEFT OUTER JOIN (
  SELECT
    event_id,
    COUNT(*) AS the_member_count
  FROM event_members
  GROUP BY event_id
) AS member_count
  ON member_count.event_id = events.id

GROUP BY events.name

Upvotes: 2

LSerni

Reputation: 57388

When you do your query, you ask for all events - let's say there are two, event Alpha and event Beta - and then JOIN with the members. Let's say that there is a member Alice that participates on both events.

SELECT events.name, COUNT(DISTINCT event_members.id), SUM(donations.amount)
FROM            events
LEFT OUTER JOIN profiles      ON events.id = profiles.event_id
LEFT OUTER JOIN donations     ON donations.profile_id = profiles.id
LEFT OUTER JOIN event_members ON event_members.event_id = events.id
GROUP BY events.name

On each row you asked the total for Alice's donations. If Alice donated 100 USD, then you asked for:

Alpha  Alice  100USD
Beta   Alice  100USD

So it's not surprising that when asking for the sum total Alice comes out as having donated 200 USD.

If you wanted the sum of all donations, you'd better doing with two distinct queries. Trying to do everything with a single query, while possible, would be a classical SQL Antipattern (actually the one in chapter #18, "Spaghetti Query"):

Unintended Products

One common consequence of producing all your results in one query is a Cartesian product. This happens when two of the tables in the query have no condition restricting their relationship. Without such a restriction, the join of two tables pairs each row in the first table to every row in the other table. Each such pairing becomes a row of the result set, and you end up with many more rows than you expect.

Upvotes: 0

wich

Reputation: 17127

Of course you get a cartesian product between donations and events for every event since both are only bound to the event, there is no join relation between donations and event_members other than the event id, which of course means that every member matches every donation.

Upvotes: 1

Left outer join two levels deep in Postgres results in cartesian product

Answers (4)

Related Questions