Ajeet
Ajeet

Reputation: 68

Query optimization to fetch data by joining tables along with operation and condition on jsonb fields

I have distribution_order and distribution_order_items tables with following schema.

CREATE TABLE public.distribution_order
(
    distributionorderid bigint NOT NULL DEFAULT nextval('distribution_order_distributionorderid_seq'::regclass),
    sourcelocationid character varying(20) COLLATE pg_catalog."default" NOT NULL,
    destinationlocationid character varying(20) COLLATE pg_catalog."default" NOT NULL,
    distributionorderheader jsonb NOT NULL,
    status character varying(25) COLLATE pg_catalog."default" NOT NULL,
    createdtimestamp timestamp without time zone NOT NULL,
    lastupdatedtimestamp timestamp without time zone,
    CONSTRAINT distribution_order_pkey PRIMARY KEY (distributionorderid)
)

---------------------------------------------------
CREATE TABLE public.distribution_order_item
(
    distributionorderid bigint NOT NULL,
    packid character varying(50) COLLATE pg_catalog."default" NOT NULL,
    status character varying(25) COLLATE pg_catalog."default" NOT NULL,
    itemdata jsonb NOT NULL,
    createdtimestamp timestamp without time zone NOT NULL,
    lastupdatedtimestamp timestamp without time zone,
    CONSTRAINT uq_distribution_order_item UNIQUE (distributionorderid, packid),
    CONSTRAINT fk_distributionorderid FOREIGN KEY (distributionorderid)
        REFERENCES public.distribution_order (distributionorderid) MATCH SIMPLE
        ON UPDATE NO ACTION
        ON DELETE NO ACTION
)

distribution_order.distributionorderheader jsonb schema:

{
  "orderType": "stock",
  "destinationLocationId": "1008",
  "orderIds": [
    "e63b9007-dcaa-4c33-bd1d-e5fbcced0913"
  ],
  "sourceLocationId": "1002",
  "deliveryDate": "2018-07-13T23:00:00.000Z",
  "orderedDate": "2018-07-11T17:00:00.000Z"
}

distribution_order_item.itemdata jsonb schema:

{
  "packId": "100003366",
  "orderedQuantity": 0,
  "pickedInfo": [
    {
      "pickDate": "2019-04-17T10:05:36Z",
      "lotNumber": "20191218",
      "quantityType": "CS",
      "containerId": "10000000000000014290",
      "quantity": "4.000000"
    }
  ]
  "shipInfo": [
    {
      "shippedDate": "2019-04-17T10:05:36Z",
      "lotNumber": "20191219",
      "quantityType": "CS",
      "containerId": "33333",
      "shippedQuantity": "4.000000"
    }
  ]
}

ItemData under distrubtion_order_item table will be updated on each shipment received with additional jsonb bock under itemdata.shipInfo .I have a requirement to calculate shipped quantity by adding 'shipedQuantity' under shipInfo , pickedQuantity by adding quantity under itemdata.pickInfo .Query I have written is as follows which is long running, any help to optimize it will be highly appreciated.

WITH D_ORD AS (
   SELECT distributionorderid,
   destinationlocationid,
   distributionorderheader->>'orderedDate' as od,
   distributionorderheader->>'deliveryDate' as dd
   FROM distribution_order
   WHERE distributionorderheader->>'deliveryDate' > '2019-04-23 17:09:46.46' ),
D_ORD_ITEMS AS (    
    SELECT  
        d.distributionorderid,
        i.itemdata->>'itemId' AS itemid,
        d.od::TIMESTAMP AS ordereddate,
        d.destinationlocationid,
        i.packid AS unitid,
        i.itemdata->>'orderedQuantity' orderedquantity,
        CASE WHEN i.status='SHIPPED' OR i.status='COMPLETE' 
             THEN CASE WHEN (i.itemdata->>'orderedQuantity')::float - sum((s->>'shippedQuantity')::float) >= 0 
                       THEN (i.itemdata->>'orderedQuantity')::float - sum((s->>'shippedQuantity')::float) 
                       ELSE 0 
                  END 
             ELSE CASE WHEN i.status='PICKED'  
                       THEN CASE WHEN (i.itemdata->>'orderedQuantity')::float - sum((p->>'quantity')::float) >= 0 
                                 THEN (i.itemdata->>'orderedQuantity')::float - sum((p->>'quantity')::float) 
                                 ELSE 0 
                            END 
                       ELSE 0 
                  END 
        END AS cancelorderquantity, 
        CASE WHEN (CASE WHEN i.status='SHIPPED' OR i.status='COMPLETE' 
                        THEN (i.itemdata->>'orderedQuantity')::float - sum((s->>'shippedQuantity')::float) 
                        ELSE 0 
                   END >0) 
             THEN CURRENT_DATE::TIMESTAMP::text 
             ELSE CASE WHEN (CASE WHEN i.status='PICKED'  
                                  THEN (i.itemdata->>'orderedQuantity')::float - sum((p->>'quantity')::float) 
                                  ELSE 0 
                             END >0) 
                       THEN CURRENT_DATE::TIMESTAMP::text 
                       ELSE '' 
                  END 
        END AS cancelleddate, 
        CASE WHEN (sum((s->>'shippedQuantity')::float) > 0 OR sum((p->>'quantity')::float) >0) 
             THEN CURRENT_DATE::TIMESTAMP::text 
             ELSE '' 
        END AS arrivedate , 
        CASE WHEN extract(HOUR FROM (d.dd)::TIMESTAMP) =23 
             THEN ((d.dd::TIMESTAMP + interval '1 DAY')::date + interval '6 hour') 
             ELSE d.dd::TIMESTAMP::date + interval '6 hour' 
        END AS exp_av, 
        CASE WHEN sum((s->>'shippedQuantity')::float) >0 
             THEN sum((s->>'shippedQuantity')::float) 
             ELSE CASE WHEN sum((p->>'quantity')::float)>0 
                       THEN sum((p->>'quantity')::float) 
                       ELSE 0 
                  END 
        END AS receivedquantity 
    FROM D_ORD d, distribution_order_item i LEFT JOIN jsonb_array_elements(i.itemdata->'shipmentInfo') s ON TRUE 
                LEFT JOIN jsonb_array_elements(i.itemdata->'pickedInfo')p ON TRUE 
    GROUP BY d.distributionorderid,d.destinationlocationid, i.itemdata->>'itemId', d.od::TIMESTAMP, i.packid, 
                i.itemdata->>'orderedQuantity', 
                i.itemdata->>'packSize', 
                i.status, 
                d.dd::TIMESTAMP)

SELECT * from   D_ORD_ITEMS

Upvotes: 0

Views: 48

Answers (1)

Michael Buen
Michael Buen

Reputation: 39393

If the JSON's schema is well-defined, don't use JSON. Refactor your table to use normal fields instead of JSON, then create index on performance-critical fields. JSON fields are good for loosely-defined schema, like the patients' records

When designing a database to use normal fields vs JSON fields, walk away from the database first. Think of what application you would use if there is no RDBMS to save your data.

If you reach for Excel to save your data, that is your data model tend to be tabular..

action      product     qty
received    mouse       3
sold        mouse       2
received    keyboard    26

..then just use normal fields, don't go fancy using JSON fields.

If you tend to reach for Word (or even OneNote or Notepad) instead of Excel to save your data (e.g., patients' records), then that's a good indicator that your data model is a loosely-defined one, which can be facilitated by JSON, then by all means use JSON. As not all patients have similar details of records to keep, it will be hard if not impossible to to add a new field(s) to table every time there's a new detail that needs to be recorded; so yeah, use JSON instead.

Patient: John Doe

Intake Criteria: 
    Dm Dx Date: 2012/03/12
    Initial Hgb A1c: 6.5
    Co Morbid: Hypertension, CAD 

Labs:
    LDL Cholestrol:
        LDL Level: 122.5,
        LDL Result Date: 2012/07/06
    Serum Creatinine:
        CreatinineLevel: 1.4
        Creatinine Result Date: 12/07/08

------


Patient: Jane Doe

Blood Type: DB

CareLocation:
    FacilityName: East Side Clinic
    ContactEmail: [email protected]

Weight: 60kg

If your data tend to be document-based (think OneNote, Word, etc), use JSON. If your data tend to be sheet-based (that is, tabular, think Excel), don't use JSON, use normal fields instead. Normal fields are natural for RDBMS to access, and natural for RDBMS to create an index on.

If refactoring the database to convert your JSON properties to normal fields can't be accommodated in your current development schedule, you can make your JSON properties be treated as "normal fields" by creating an index on those properties.

Create an index on well-defined data, these well-defined data are the ones that are usually accessed and/or are always present in your JSON data, most of those well-defined data can be found on GROUP BY or WHERE clauses.

It's fortunate that Postgres can create an index on expression, this means you can also create an index on a JSON expression (e.g., itemdata->>'itemId'). If you've used other RDBMS that can't create an index on expression, you just painted yourself at the corner, as execution plans on queries that uses JSON will always resolve to sequential scan of table instead of index scan.

It's still a good design to use normal fields for data instead of JSON properties if those data are well-defined, regularly used. Factor in the refactoring of your database design to your development schedule, convert those JSON fields to normal fields later.

To cut to the chase, here's a proof-of-concept that shows your query that involves JSON can still benefit from the performance associated with normal fields by creating an index on regularly-used JSON properties.

create table tbl
(
  id int generated by default as identity primary key,
  -- on older version of Postgres, use this instead:
  --   id serial primary key

  document jsonb not null
);


create index ix_tbl__document_code on tbl(((document->>'code')::int));

insert into tbl(document) 
select 
    jsonb_build_object
    ('name', 'Slim', 
     'nick', 'Shady', 
     'category', 'Musician', 
     'ack', x.i, 
     'code', x.i
     )
from generate_series(1,500000) as x(i)

Here's the execution plan when there's an index on (document->>'code')::int expression:

explain analyze
select * from tbl
where (document->>'code')::int = 42;

Output: enter image description here

Here's the execution plan if there is no index created on expression:

explain analyze
select * from tbl
where (document->>'ack')::int = 42;

Output: enter image description here

0.018 milliseconds vs 52.335 milliseconds

Upvotes: 1

Related Questions