nzifnab
nzifnab

Reputation: 16110

Using a complicated double join to get a count of child objects

Note that I'm using postgresql

I have an organizations table, a users table, a jobs table, and a documents table. I want to get a list of the organizations ordered by the number of total documents they have access to.

organizations
------------
id (pk)
company_name

users
------------
id (pk)
organization_id

jobs
------------
id (pk)
client_id    (id of an organization)
server_id    (id of an organization)
creator_id   (id of a user)

documents
------------
id (pk)
job_id

Result Desired

organizations.id  |  organizations.company_name  |  document_count
85                |  Big Corporation             |  84
905               |  Some other folks            |  65
403               |  ACME, Inc                   |  14

As you can see, an organization can be connected to a document through 3 different paths:

  1. organizations.id => jobs.client_id => documents.job_id
  2. organizations.id => jobs.server_id => documents.job_id
  3. organizations.id => users.organization_id => jobs.creator_id => documents.job_id

But I want a query that will get the count of all the documents each company has access to...

I tried a couple of things... like this:

SELECT COUNT(documents.id) document_count, organizations.id, organizations.company_name
FROM organizations
INNER JOIN users ON organizations.id = users.organization_id
INNER JOIN jobs ON (
  jobs.client_id = organizations.id OR
  jobs.server_id = organizations.id OR
  jobs.creator_id = users.id
)
INNER JOIN documents ON documents.job_id = jobs.id
GROUP BY organizations.id, organizations.company_name
ORDER BY document_count DESC
LIMIT 10

The query takes awhile to run, but it's not horrible since i'm doing it for a one-time report, but the results... cannot possibly be correct.

The first listed organization has a reported count of 129,834 documents -- but that's impossible since there's only 32,820 records in the documents table. I feel like it must be counting drastic quantities of duplicates (due to an error in one of my joins?) but I'm not sure where I've gone wrong.

The order appears correct since the highest volume user of the system is clearly at the top of the list... but the value is inflated somehow.

Upvotes: 1

Views: 233

Answers (5)

wildplasser
wildplasser

Reputation: 44250

But I want a query that will get the count of all the documents you have access to...

That's where your query starts:

SELECT ... FROM documents
...

Since the only clue to the documents table is in jobs, you'll need the jobs table as well::

SELECT ... 
FROM documents dc
JOIN jobs jo ON jo.document_id = dc.id
...

Now, it is time for restrictions. Which documents do you actually want ? There are three cases you want: either the client_id matches the organisation, or the server_id maches the company, or the creator_id matches a user that happens to work for the company:

SELECT ... 
FROM documents dc
JOIN jobs jo ON jo.document_id = dc.id
WHERE jo.client_id = $THE_COMPANY
   OR jo.server_id = $THE_COMPANY
   OR EXISTS (
      SELECT *
      FROM users uu
      JOIN organizations oo ON uu.organization_id = ex.id
      WHERE uu.id = jo.creator_id
        AND oo.id = $THE_COMAPNY
      )
     ;

But, there might be a problem here. If two or more different jobs-records would point to the same document, you would count these double. You can either add a DISTINCT to the outer query, or move the jobs-table down into a subquery:

SELECT ... 
FROM documents dc
WHERE EXISTS (
  SELECT *
  FROM jobs jo
  WHERE jo.document_id = dc.id
  AND ( jo.client_id = $THE_COMPANY
      OR jo.server_id = $THE_COMPANY
      OR EXISTS (
        SELECT *
        FROM users uu
        JOIN organizations oo ON uu.organization_id = ex.id
        WHERE uu.id = jo.creator_id
        AND oo.id = $THE_COMAPNY
        )
      )
    )
  ;

As you can see, the thee ways of selecting a document end up in a WHERE (a OR b OR c) clause.

UPDATE: (since the OP does not give us the table definions in a useble form I had to reconstruct these)

DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;
        --
        -- create the missing tables
        --
CREATE TABLE organizations
        ( id SERIAL NOT NULL PRIMARY KEY
        , company_name varchar
        );

CREATE TABLE users
        ( id SERIAL NOT NULL PRIMARY KEY
        , organization_id INTEGER NOT NULL REFERENCES organizations(id)
        );

CREATE TABLE jobs
        ( id SERIAL NOT NULL PRIMARY KEY
        , client_id    INTEGER NOT NULL REFERENCES organizations(id)
        , server_id    INTEGER NOT NULL REFERENCES organizations(id)
        , creator_id   INTEGER NOT NULL REFERENCES users(id)
        );

CREATE TABLE documents
        ( id SERIAL NOT NULL PRIMARY KEY
        , job_id INTEGER NOT NULL REFERENCES jobs(id)
        );
        --
        -- Populate
        --
INSERT INTO organizations(id, company_name) VALUES
 (85,'Big Corporation') ,(905,'Some other folks') ,(403,'ACME, Inc')
        ;
select setval('organizations_id_seq', 905);

INSERT INTO users(organization_id)
SELECT o.id
FROM generate_series(1,1000)
JOIN organizations o ON random() < 0.3
        ;
INSERT INTO jobs (client_id,server_id,creator_id)
SELECT o1.id, o2.id, u.id
FROM users u
JOIN organizations o1 ON 1=1
JOIN organizations o2 ON o2.id <> o1.id
        ;
INSERT INTO documents(job_id)
SELECT id FROM jobs j
        ;
DELETE FROM documents
WHERE random() < 0.5
        ;

        --
        -- And the query ...
        --
EXPLAIN ANALYZE
SELECT o.id AS org
        , count(*) AS the_docs
FROM organizations o
JOIN documents d  ON 1=1 -- start with a carthesian product
WHERE EXISTS (
        SELECT *
        FROM jobs j
        WHERE d.job_id = j.id
        AND (j.client_id = o.id OR j.server_id = o.id )
        )
OR EXISTS (
        SELECT *
        FROM jobs j
        JOIN users u ON j.creator_id = u.id
        WHERE u.organization_id = o.id
        AND d.job_id = j.id
        )
GROUP BY o.id
        ;

Upvotes: 0

nzifnab
nzifnab

Reputation: 16110

None of the answers quite got me there except for the one suggesting a UNION. This is what I came up with:

SELECT COUNT(docs.doc_id) document_count, docs.org_id, docs.org_name
FROM (
  SELECT documents.id doc_id, organizations.id org_id, organizations.company_name org_name
  FROM documents
  INNER JOIN jobs ON documents.job_id = jobs.id
  INNER JOIN organizations ON jobs.client_id = organizations.id
  UNION
  SELECT documents.id doc_id, organizations.id org_id, organizations.company_name org_name
  FROM documents
  INNER JOIN jobs ON documents.job_id = jobs.id
  INNER JOIN organizations ON jobs.server_id = organizations.id
  UNION
  SELECT documents.id doc_id, organizations.id org_id, organizations.company_name org_name
  FROM documents
  INNER JOIN jobs on documents.job_id = jobs.id
  INNER JOIN users ON jobs.creator_id = users.id
  INNER JOIN organizations ON users.organization_id = organizations.id
) docs
GROUP BY org_id, org_name
ORDER BY document_count DESC

The performance was much better than any of the people suggesting subqueries and it appears to have given me a reasonable answer

Upvotes: 0

huntharo
huntharo

Reputation: 2856

It's probably too late to redesign this, but you really should.

The jobs table should not have its own id field a d key.

The jobs table is horribly designed because every reference to a disk page from the id index is gonna have to go read 1-100 different pages from disk out of the data file just to get the three other id fields that you always want to use (which is the clue that a job should not have its own id).

You can make a quick fix by making jobs use an index that is clustered or clustering ( depending on the db system) on the job id field. And alternative will be to mark the other three id fields as "includes" on the index so the page reads to the data file will 100% go away. Either of these may be enough to make this "just work".

What I would encourage you to do though is drop the id field and key on jobs and instead make a "natural key" that has the three other id fields in it and use that key on the documents table as well.

I would also demoralize (repeat) the organization of the creator on the jobs table and the document table. A user isn't going to move to another org and keep the same acces, so you should never have to run a sweep to update these in sync and even if you did it would be easy.

With these changes you can just do a select on the documents table directly, skipping the random pages reads needed from the other tables. The group by to group across the three different id fields would be a bit tricky. I might give this a try as it is interesting.

In the short term though, try clustering or includes on the jobs table to solve the performance issue and I will check the join logic tonight.

Upvotes: 0

Dale M
Dale M

Reputation: 2473

Simplify your thinking. You have 3 paths to docid so write 3 queries, union them and count that

Upvotes: 1

ruakh
ruakh

Reputation: 183504

The problem is that if jobs.client_id = organizations.id or jobs.server_id = organizations.id, then there's nothing to filter your INNER JOIN users (aside from its ON clause), so you'll get a separate record for every single user that belongs to that organization. In other words, for each organization, you're adding three values:

  • its total number of users times the total number of documents belonging to jobs for which it's a client
  • its total number of users times the total number of documents belonging to jobs for which it's a server
  • the total number of documents belonging to jobs for which one if its users is the creator

One way to fix this is to remove the INNER JOIN users line, and change this:

  jobs.creator_id = users.id

to this:

  jobs.creator_id IN (SELECT id FROM users WHERE organization_id = organizations.id)

. . . but that might perform terribly. You might need to try a few things before finding a query that performs acceptably.

Upvotes: 1

Related Questions