subQuery with groupBy much slower than querying by the join

Question

I'm using sequelize to run some queries on my postgres database. Because of the pagination I'm doing, I discovered that I had to use a subQuery and group by the primary key of the primary model I'm querying against. While this fixed the issue I had with not getting a full page of results, the query is much slower (3200ms vs 60ms). Sadly I'm not an expert at SQL to recognize what I could do to speed it up to make it decently performant.

The sequelize query I'm running is:

var query = {
    limit: 10,
    where: {},
    include: [{model: db.FinancialCompany, through:{where:{address_zip:req.query.zip}}, required:true}, {model: db.Disclosure, required: false}],
    order: [['last_name', 'ASC']],
    groupBy: ['FinancialProfessional.id'],
    subQuery: true
  }
  db.FinancialProfessional.findAndCount(
      query
  ).then(function (professionals) {
    res.jsonp(professionals);
    return professionals;
  })

which gets converted to

SELECT "FinancialProfessional".*,
       "FinancialCompanies"."id" AS "FinancialCompanies.id",
       "FinancialCompanies"."name" AS "FinancialCompanies.name",
       "FinancialCompanies"."address_street" AS "FinancialCompanies.address_street",
       "FinancialCompanies"."address_city" AS "FinancialCompanies.address_city",
       "FinancialCompanies"."address_state" AS "FinancialCompanies.address_state",
       "FinancialCompanies"."address_zip" AS "FinancialCompanies.address_zip",
       "FinancialCompanies"."crd" AS "FinancialCompanies.crd",
       "FinancialCompanies"."createdAt" AS "FinancialCompanies.createdAt",
       "FinancialCompanies"."updatedAt" AS "FinancialCompanies.updatedAt",
       "FinancialCompanies.ProfessionalToCompany"."address_street" AS "FinancialCompanies.ProfessionalToCompany.address_street",
       "FinancialCompanies.ProfessionalToCompany"."address_city" AS "FinancialCompanies.ProfessionalToCompany.address_city",
       "FinancialCompanies.ProfessionalToCompany"."address_state" AS "FinancialCompanies.ProfessionalToCompany.address_state",
       "FinancialCompanies.ProfessionalToCompany"."address_zip" AS "FinancialCompanies.ProfessionalToCompany.address_zip",
       "FinancialCompanies.ProfessionalToCompany"."createdAt" AS "FinancialCompanies.ProfessionalToCompany.createdAt",
       "FinancialCompanies.ProfessionalToCompany"."updatedAt" AS "FinancialCompanies.ProfessionalToCompany.updatedAt",
       "FinancialCompanies.ProfessionalToCompany"."FinancialCompanyId" AS "FinancialCompanies.ProfessionalToCompany.FinancialCompanyId",
       "FinancialCompanies.ProfessionalToCompany"."FinancialProfessionalId" AS "FinancialCompanies.ProfessionalToCompany.FinancialProfessionalId",
       "Disclosures"."id" AS "Disclosures.id",
       "Disclosures"."info" AS "Disclosures.info",
       "Disclosures"."createdAt" AS "Disclosures.createdAt",
       "Disclosures"."updatedAt" AS "Disclosures.updatedAt",
       "Disclosures"."FinancialProfessionalId" AS "Disclosures.FinancialProfessionalId",
       "Disclosures"."RegulatoryAgencyId" AS "Disclosures.RegulatoryAgencyId"
FROM
  (SELECT "FinancialProfessional"."id",
          "FinancialProfessional"."full_name",
          "FinancialProfessional"."last_name",
          "FinancialProfessional"."alternate_names",
          "FinancialProfessional"."title",
          "FinancialProfessional"."crd",
          "FinancialProfessional"."licensed",
          "FinancialProfessional"."display_count",
          "FinancialProfessional"."years_f",
          "FinancialProfessional"."years_s",
          "FinancialProfessional"."createdAt",
          "FinancialProfessional"."updatedAt",
          "FinancialProfessional"."UserId"
   FROM "FinancialProfessionals" AS "FinancialProfessional"
   WHERE
       (SELECT "ProfessionalToCompany"."FinancialCompanyId"
        FROM "ProfessionalToCompanies" AS "ProfessionalToCompany"
        INNER JOIN "FinancialCompanies" AS "FinancialCompany" ON "ProfessionalToCompany"."FinancialCompanyId" = "FinancialCompany"."id"
        WHERE ("FinancialProfessional"."id" = "ProfessionalToCompany"."FinancialProfessionalId"
               AND "ProfessionalToCompany"."address_zip" = '94596') LIMIT 1) IS NOT NULL
   GROUP BY "FinancialProfessional"."id"
   ORDER BY "FinancialProfessional"."last_name" ASC LIMIT 10) AS "FinancialProfessional"
INNER JOIN ("ProfessionalToCompanies" AS "FinancialCompanies.ProfessionalToCompany"
            INNER JOIN "FinancialCompanies" AS "FinancialCompanies" ON "FinancialCompanies"."id" = "FinancialCompanies.ProfessionalToCompany"."FinancialCompanyId"
            AND "FinancialCompanies.ProfessionalToCompany"."address_zip" = '94596') ON "FinancialProfessional"."id" = "FinancialCompanies.ProfessionalToCompany"."FinancialProfessionalId"
LEFT OUTER JOIN "Disclosures" AS "Disclosures" ON "FinancialProfessional"."id" = "Disclosures"."FinancialProfessionalId"
ORDER BY "FinancialProfessional"."last_name" ASC;

Doing an analyze on the query gives me:

Nested Loop Left Join  (cost=17155066.40..17155166.22 rows=1 width=2423) (actual time=5098.656..5098.780 rows=12 loops=1)
  ->  Nested Loop  (cost=17155065.98..17155157.78 rows=1 width=2343) (actual time=5098.648..5098.736 rows=10 loops=1)
        ->  Nested Loop  (cost=17155065.69..17155149.94 rows=1 width=227) (actual time=5098.642..5098.702 rows=10 loops=1)
              ->  Limit  (cost=17155065.27..17155065.29 rows=10 width=161) (actual time=5098.618..5098.624 rows=10 loops=1)
                    ->  Sort  (cost=17155065.27..17158336.49 rows=1308489 width=161) (actual time=5098.617..5098.618 rows=10 loops=1)
                          Sort Key: "FinancialProfessional".last_name
                          Sort Method: top-N heapsort  Memory: 27kB
                          ->  Group  (cost=0.43..17126789.29 rows=1308489 width=161) (actual time=10.895..5096.539 rows=909 loops=1)
                                Group Key: "FinancialProfessional".id
                                ->  Index Scan using "FinancialProfessionals_pkey" on "FinancialProfessionals" "FinancialProfessional"  (cost=0.43..17123518.07 rows=1308489 width=161) (actual time=10.893..5095.345 rows=909 loops=1)
                                      Filter: ((SubPlan 1) IS NOT NULL)
                                      Rows Removed by Filter: 1314155
                                      SubPlan 1
                                        ->  Limit  (cost=0.71..12.76 rows=1 width=4) (actual time=0.003..0.003 rows=0 loops=1315064)
                                              ->  Nested Loop  (cost=0.71..12.76 rows=1 width=4) (actual time=0.002..0.002 rows=0 loops=1315064)
                                                    ->  Index Scan using "ProfessionalToCompanies_pkey" on "ProfessionalToCompanies" "ProfessionalToCompany"  (cost=0.42..8.45 rows=1 width=4) (actual time=0.002..0.002 rows=0 loops=1315064)
                                                          Index Cond: ("FinancialProfessional".id = "FinancialProfessionalId")
                                                          Filter: ((address_zip)::text = '94596'::text)
                                                          Rows Removed by Filter: 1
                                                    ->  Index Only Scan using "FinancialCompanies_pkey" on "FinancialCompanies" "FinancialCompany"  (cost=0.29..4.30 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=909)
                                                          Index Cond: (id = "ProfessionalToCompany"."FinancialCompanyId")
                                                          Heap Fetches: 0
              ->  Index Scan using "ProfessionalToCompanies_pkey" on "ProfessionalToCompanies" "FinancialCompanies.ProfessionalToCompany"  (cost=0.42..8.45 rows=1 width=66) (actual time=0.006..0.006 rows=1 loops=10)
                    Index Cond: ("FinancialProfessionalId" = "FinancialProfessional".id)
                    Filter: ((address_zip)::text = '94596'::text)
        ->  Index Scan using "FinancialCompanies_pkey" on "FinancialCompanies"  (cost=0.29..7.82 rows=1 width=2116) (actual time=0.002..0.002 rows=1 loops=10)
              Index Cond: (id = "FinancialCompanies.ProfessionalToCompany"."FinancialCompanyId")
  ->  Index Scan using fp_d_id on "Disclosures"  (cost=0.42..8.44 rows=1 width=80) (actual time=0.003..0.003 rows=0 loops=10)
        Index Cond: ("FinancialProfessional".id = "FinancialProfessionalId")
Planning time: 0.644 ms
Execution time: 5098.873 ms

Schema:

CREATE TABLE public."FinancialProfessionals"
(
  id integer NOT NULL DEFAULT nextval('"FinancialProfessionals_id_seq"'::regclass),
  full_name character varying(255),
  last_name character varying(255),
  alternate_names character varying(255)[],
  title character varying(255)[],
  crd integer,
  licensed boolean,
  "createdAt" timestamp with time zone NOT NULL,
  "updatedAt" timestamp with time zone NOT NULL,
  tsv tsvector,
  "UserId" integer,
  display_count integer DEFAULT 0,
  years_f integer,
  years_s integer,
  CONSTRAINT "FinancialProfessionals_pkey" PRIMARY KEY (id)
)
WITH (
  OIDS=FALSE
); 
CREATE INDEX last_name_idx
  ON public."FinancialProfessionals"
  USING btree
  (last_name COLLATE pg_catalog."default");
CREATE INDEX name_idx
  ON public."FinancialProfessionals"
  USING gin
  (tsv);
CREATE INDEX crd_idx
  ON public."FinancialProfessionals"
  USING btree
  (crd);

CREATE TABLE public."ProfessionalToCompanies"
(
  address_street character varying(255),
  address_city character varying(255),
  address_state character varying(255),
  address_zip character varying(255),
  "createdAt" timestamp with time zone NOT NULL,
  "updatedAt" timestamp with time zone NOT NULL,
  "FinancialProfessionalId" integer NOT NULL,
  "FinancialCompanyId" integer NOT NULL,
  CONSTRAINT "ProfessionalToCompanies_pkey" PRIMARY KEY ("FinancialProfessionalId", "FinancialCompanyId"),
  CONSTRAINT "ProfessionalToCompanies_FinancialCompanyId_fkey" FOREIGN KEY ("FinancialCompanyId")
      REFERENCES public."FinancialCompanies" (id) MATCH SIMPLE
      ON UPDATE CASCADE ON DELETE CASCADE,
  CONSTRAINT "ProfessionalToCompanies_FinancialProfessionalId_fkey" FOREIGN KEY ("FinancialProfessionalId")
      REFERENCES public."FinancialProfessionals" (id) MATCH SIMPLE
      ON UPDATE CASCADE ON DELETE CASCADE
)
WITH (
  OIDS=FALSE
);

CREATE INDEX zip_idx
  ON public."ProfessionalToCompanies"
  USING btree
  (address_zip COLLATE pg_catalog."default");

CREATE TABLE public."FinancialCompanies"
(
  id integer NOT NULL DEFAULT nextval('"FinancialCompanies_id_seq"'::regclass),
  name character varying(255),
  address_street character varying(255),
  address_city character varying(255),
  address_state character varying(255),
  address_zip character varying(255),
  crd integer,
  "createdAt" timestamp with time zone NOT NULL,
  "updatedAt" timestamp with time zone NOT NULL,
  company_name_tsv tsvector,
  years_f integer,
  CONSTRAINT "FinancialCompanies_pkey" PRIMARY KEY (id)
)
WITH (
  OIDS=FALSE
);
CREATE INDEX company_name_idx
  ON public."FinancialCompanies"
  USING gin
  (company_name_tsv);

CREATE TABLE public."Disclosures"
(
  id integer NOT NULL DEFAULT nextval('"Disclosures_id_seq"'::regclass),
  info text,
  "createdAt" timestamp with time zone NOT NULL,
  "updatedAt" timestamp with time zone NOT NULL,
  "FinancialProfessionalId" integer,
  "RegulatoryAgencyId" integer,
  CONSTRAINT "Disclosures_pkey" PRIMARY KEY (id),
  CONSTRAINT "Disclosures_FinancialProfessionalId_fkey" FOREIGN KEY ("FinancialProfessionalId")
      REFERENCES public."FinancialProfessionals" (id) MATCH SIMPLE
      ON UPDATE CASCADE ON DELETE SET NULL,
  CONSTRAINT "Disclosures_RegulatoryAgencyId_fkey" FOREIGN KEY ("RegulatoryAgencyId")
      REFERENCES public."RegulatoryAgencies" (id) MATCH SIMPLE
      ON UPDATE CASCADE ON DELETE SET NULL
)
WITH (
  OIDS=FALSE
);
CREATE INDEX fp_d_id
  ON public."Disclosures"
  USING btree
  ("FinancialProfessionalId");
CREATE INDEX fp_r_id
  ON public."Disclosures"
  USING btree
  ("RegulatoryAgencyId");

FWIW the following query runs in about 64ms

SELECT fp.full_name, array_agg(ptc), array_agg(d)
FROM
  "ProfessionalToCompanies"      ptc
    JOIN "FinancialCompanies"     fc ON ptc."FinancialCompanyId" = fc.id
    JOIN "FinancialProfessionals" fp ON fp.id = ptc."FinancialProfessionalId"
    LEFT OUTER JOIN "Disclosures"  d ON fp.id = d."FinancialProfessionalId"
WHERE ptc.address_zip = '94596'
GROUP BY fp.id
ORDER BY fp.last_name ASC
limit 10

Is there some sort of index I could add or something to cause this query to be performant?

Chris Travers · Accepted Answer

The obvious index candidate would be on your ordering criteria. In this way, PostgreSQL may do a nested loop over the index in order until the limit criteria are met. That would certainly help quite a bit.

However be careful with that. Such an index will likely perform quite a bit worse if there are many records that have to be skipped over because of other criteria.

Edit

On seeing the explain analyze portion, what struck me was that you are seeing the nested loop for the subquery not retrieve any results in most cases, and run 1.3 million times. That actually accounts for the bulk of the time you are reporting grouping time. The actual sort is very quick because there are almost no rows at that point. Maybe try an index of last_name and id in that order?

At this point i am not entirely sure. Also check your GEQO settings.

EDIT2

The problem as I read the analyze result is that you are forced to aggregate inside a subquery you are using in a where clause. This would explain why using subQuery would have the negative performance impact.

Then you have the limit, which makes PostgreSQL think "hey, I can do a nested loop here and will probably be faster since I can stop after I find 10 rows" but as it goes through the nested loop it never finds any rows so that turns out to be a very bad plan.

I don't see an easy way to optimize this via the ORM without some other layer put in place.

subQuery with groupBy much slower than querying by the join

Answers (1)

Related Questions