joel.software
joel.software

Reputation: 1515

postgresql two nested joins and arrays in join

I've never had to ask an SO question before, I always eventually stumble upon what I'm looking for through searching, but there's a lot going on here and I've gotten to the end of my searching prowess... Thanks for your help / suggestions.

The problem

I need to generate reports on the fly about how many 'students' in my organization have taken each 'assessment', and the standards associated with each 'assessment'.

I'm trying to do this in the most performant way that minimizes the load on Postgres but I don't believe I've achieved that goal...

The Setup

Server: Amazon EC2 running Ubuntu

The App (Server): Node.js (0.10.26)

The App (Client): Angular.js (1.2.13)

Database (Queries): PostgreSQL (9.3.1)

Database (Cache): MongoDB (Hope to cache report in future)

The Tables

SQL FIDDLE

assessments

_id                int4
description        text
category           varchar(60)
sub_standards      int4 ARRAY
...

assessments_report (pre-calculated sums)

assessment_id(FK)  int4
client_id(FK)      int4
students           int4
completed          int4
incomplete         int4
...

sub_standards

_id
standard_id(FK)    int4
description        varchar(255)
...

standards

_id                int4
name               varchar(60)
description        varchar(255)
...    

The Query

//Stored as array for readability in Node-Postgres use
SELECT r.*, as.* FROM assessments_report r
INNER JOIN (
  SELECT a._id AS assessment_id, a.description, a.category, a.states,
  array_to_json(array_agg(ss.*)) AS standards
  FROM assessments a LEFT JOIN (
    SELECT ss.*, s.name AS parent_name, s.description AS parent_description
    FROM sub_standards ss
    INNER JOIN standards s ON ss.standard_id = s._id
  ) ss ON ss._id = ANY (a.sub_standards) GROUP BY a._id
) as
ON as.assessment_id = r.assessment_id
WHERE r.client_id = $1

The Desired Output For Each Row (shown as JSON)

assessment_id: 2,
students: 2,
complete: 1
incomplete: 1,
description: "...",
category: "...",
states: ["AL","AZ",...],
standards: [
  {
    _id: 1,
    standard_id: 3,
    description: "...",
    parent_name: "...",
    parent_description: "..."
  },
  {
    _id: 2,
    standard_id: 4,
    description: "...",
    parent_name: "...",
    parent_description: "..."
  },
]

SQL FIDDLE

Upvotes: 5

Views: 10769

Answers (1)

Kirk Roybal
Kirk Roybal

Reputation: 17867

EXPLAIN of your query produces a max cost of 537. You can take that down to 59 (factor 10 improvement) by just creating an index:

CREATE INDEX idx_sstandard_sid ON sub_standards (standard_id);

But this is even faster with a cost of 21:

WITH r AS (SELECT * FROM assessments_report WHERE client_id = $1)
, ss AS (SELECT r.*, a.description, a.category, a.states , a.sub_standards
    FROM r INNER JOIN assessments a ON a._id = r.assessment_id)
, ss3 AS (SELECT ss2.*, s.name parent_name, s.description parent_description
            FROM sub_standards ss2
            INNER JOIN standards s
            ON ss2.standard_id = s._id)
, ss1 AS (SELECT ss.*, array_to_json(array_agg(ss3.*)) standards
    FROM ss
    LEFT JOIN ss3
    ON ss3._id = ANY (ss.sub_standards)
    GROUP BY 1,2,3,4,5,6,7,8,9,10,11,12,13,14 ) 
SELECT _id, client_id, region_id, group_id, assessment_id, students, basic, 
    proficient, advanced, in_progress, description, category, states, standards
FROM ss1;

You could probably get rid of the grouping and go somewhat faster than this.

The basic idea is to start with the most limiting constraint (client_id), expand that by the next criteria (sub_standards) and never expand the recordset any more than necessary after that. Lookups against primary keys are faster than complex joins. Hope this helps.

Final query plan:

CTE Scan on ss1  (cost=21.79..21.81 rows=1 width=434)
CTE r
->  Seq Scan on assessments_report  (cost=0.00..1.06 rows=1 width=40)
      Filter: (client_id = 4)
CTE ss
->  Nested Loop  (cost=0.14..8.19 rows=1 width=434)
      ->  CTE Scan on r  (cost=0.00..0.02 rows=1 width=40)
      ->  Index Scan using assesments_pkey on assessments a  (cost=0.14..8.16 rows=1 width=398)
            Index Cond: (_id = r.assessment_id)
CTE ss3
->  Hash Join  (cost=1.07..12.34 rows=3 width=1346)
      Hash Cond: (s._id = ss2.standard_id)
      ->  Seq Scan on standards s  (cost=0.00..10.90 rows=90 width=818)
      ->  Hash  (cost=1.03..1.03 rows=3 width=532)
            ->  Seq Scan on sub_standards ss2  (cost=0.00..1.03 rows=3 width=532)
CTE ss1
->  HashAggregate  (cost=0.19..0.20 rows=1 width=462)
      ->  Nested Loop Left Join  (cost=0.00..0.15 rows=1 width=462)
            Join Filter: (ss3._id = ANY (ss.sub_standards))
            ->  CTE Scan on ss  (cost=0.00..0.02 rows=1 width=434)
            ->  CTE Scan on ss3  (cost=0.00..0.06 rows=3 width=32)

Upvotes: 3

Related Questions