How can I optimise this complex PostgreSQL query

Question

I have been given a complex PostgreSQL query, and been asked to optimise it. I've made some progress in establishing where the problem might be, however I'm running out of ideas for how to approach optimising it. I'm hoping for some pointers/suggestions.

This query is part of an "AI Chatbot" system, it returns the number of unique conversations had with the chatbot, with a few filters. The data returned looks like:

conversationId	userId	userName	channel	eventCount	operator	firstMessageDate	lastMessageDate	totalUniqueUsers
Id of the conversation	Id of the user	Name of the user	Number of events that occured during conversation	If true, the conversation was with an "admin" or inhouse person	Date of the first message in the conversation	Date of the last message in the conversatrion	Total number of unique user conversations

The problem with the query seems to be related to a specific 'inner join' clause, joining on a large (10540420 rows) table.

This is the query with the inner join commented out, this way it will run in around 2 seconds.

select 
  "conversation_table"."conversationid" as "conversationId", 
  "du"."userid" as "userId", 
  "du"."username" as "userName", 
  "du"."channel" as "channel", 
  COALESCE(user_events.count, 0):: integer as "eventCount", 
  BOOL_OR(dc.operator_messages > 0) as "operator", 
  DATE_PART(
    'epoch', dc.first_message_timestamp
  )* 1000 as "firstMessageDate", 
  DATE_PART(
    'epoch', dc.last_message_timestamp
  )* 1000 as "lastMessageDate", 
  COUNT(*) OVER():: integer as "totalUniqueUsers" 
from 
  "fact_conversation_data" as "conversation_table" 
  inner join "dim_user" as "du" on "conversation_table"."dim_user_user_id" = "du"."user_id" 
  inner join "dim_time" as "dt" on "conversation_table"."dim_time_time_id" = "dt"."time_id" 
  inner join "dim_conversation" as "dc" on "conversation_table"."conversationid" = "dc"."conversationid" 
--   inner join (
--     select 
--       "conversationid" 
--     from 
--       "fact_milestone_event" as fme
--     where 
--       fme.dim_segment_segment_id in ('20736b82-4515-411f-9bc8-cf4d84ad69ac')
--       group by "conversationid"
--   ) as "fme" on "fme"."conversationid" = "conversation_table"."conversationid" 
  left join (
    select 
      "fme"."conversationid", 
      count("fme"."event_id") 
    from 
      "fact_milestone_event" as "fme" 
    where 
      "fme"."timestamp" >= '2022-12-18 11:00:00.000' 
      and "fme"."timestamp" <= '2023-01-18 10:59:59.999' 
      and "fme"."dim_tenant_tenant_id" = '4621ed8f-d8a4-46e2-a8de-5710751b16b9' 
    group by 
      "fme"."conversationid"
  ) as "user_events" on "conversation_table"."conversationid" = "user_events"."conversationid" 
where 
  "conversation_table"."timestamp" >= '2022-12-18 11:00:00.000' 
  and "conversation_table"."timestamp" <= '2023-01-18 10:59:59.999' 
  and "dt"."bot_zone" = 'Pacific/Auckland' 
  and "du"."is_platform_user" <> true 
  and "conversation_table"."conversationid" in (
    select 
      "notEmptyConservation_table"."conversationid" 
    from 
      (
        select 
          sum("user_messages") as "total_user_message_count", 
          "conversationid" 
        from 
          "fact_conversation_data" 
        where 
          "fact_conversation_data"."dim_tenant_tenant_id" = '4621ed8f-d8a4-46e2-a8de-5710751b16b9' 
          and "fact_conversation_data"."timestamp" >= '2022-12-18 11:00:00.000' 
          and "fact_conversation_data"."timestamp" <= '2023-01-18 10:59:59.999' 
        group by 
          "conversationid"
      ) as "notEmptyConservation_table" 
    where 
      "notEmptyConservation_table"."total_user_message_count" > 1
  ) 
  and "conversation_table"."dim_tenant_tenant_id" = '4621ed8f-d8a4-46e2-a8de-5710751b16b9' 
  and (
    "conversation_table"."user_messages" > 0
  ) 
group by 
  "user_events"."count", 
  "conversation_table"."conversationid", 
  "du"."userid", 
  "du"."username", 
  "du"."channel", 
  "dc"."first_message_timestamp", 
  "dc"."last_message_timestamp" 
order by 
  "lastMessageDate" asc 
limit 
  20

However, if I uncomment the inner join, I get a runtime of about 9 minutes.

Here's a link to the SLOW query and execution plan in explain.depesz.com.

Here's a link for the faster version of the query, without the problematic inner join.

Here are the indexes for the two main tables involved, fact_milestone_event and fact_conversation_data.

Table Name	Index Name	Index Definition
fact_conversation_data	fact_conversation_data_conversationid_idx	CREATE INDEX fact_conversation_data_conversationid_idx ON public.fact_conversation_data USING btree (conversationid)
fact_conversation_data	fact_conversation_data_tenant_id_idx	CREATE INDEX fact_conversation_data_tenant_id_idx ON public.fact_conversation_data USING btree (dim_tenant_tenant_id)
fact_conversation_data	fact_conversation_data_timestamp_idx	CREATE INDEX fact_conversation_data_timestamp_idx ON public.fact_conversation_data USING btree (timestamp)
fact_conversation_data	conversationid_timeid_unique_idx	CREATE UNIQUE INDEX conversationid_timeid_unique_idx ON public.fact_conversation_data USING btree (conversationid, dim_time_time_id)
fact_conversation_data	fact_conversation_data_pk	CREATE UNIQUE INDEX fact_conversation_data_pk ON public.fact_conversation_data USING btree (conversation_data_id)
fact_milestone_event	fact_milestone_event_dim_segment_segment_id_idx	CREATE INDEX fact_milestone_event_dim_segment_segment_id_idx ON public.fact_milestone_event USING btree (dim_segment_segment_id)
fact_milestone_event	fact_milestone_event_time_id_idx	CREATE INDEX fact_milestone_event_time_id_idx ON public.fact_milestone_event USING btree (dim_time_time_id)
fact_milestone_event	fact_milestone_event_dim_milestone_milestone_id_idx	CREATE INDEX fact_milestone_event_dim_milestone_milestone_id_idx ON public.fact_milestone_event USING btree (dim_milestone_milestone_id)
fact_milestone_event	fact_milestone_event_conversationid_idx	CREATE INDEX fact_milestone_event_conversationid_idx ON public.fact_milestone_event USING btree (conversationid)
fact_milestone_event	fact_milestone_event_timestamp_idx	CREATE INDEX fact_milestone_event_timestamp_idx ON public.fact_milestone_event USING btree (timestamp)
fact_milestone_event	fact_milestone_event_dim_tenant_tenant_id_idx	CREATE INDEX fact_milestone_event_dim_tenant_tenant_id_idx ON public.fact_milestone_event USING btree (dim_tenant_tenant_id)
fact_milestone_event	fact_milestone_event_dim_user_user_id_idx	CREATE INDEX fact_milestone_event_dim_user_user_id_idx ON public.fact_milestone_event USING btree (dim_user_user_id)
fact_milestone_event	fact_milestone_event_pk	CREATE UNIQUE INDEX fact_milestone_event_pk ON public.fact_milestone_event USING btree (event_id)

I've tried replacing the inner join with a WHERE clause with a nested select. I've tried running EXPLAIN against the query, but had trouble understanding the output. I've tried adding more conditions to the inner join (as below).

inner join (
  select "conversationid" 
  from "fact_milestone_event" as fme
  where 
  fme.dim_segment_segment_id in ('20736b82-4515-411f-9bc8-cf4d84ad69ac') AND
  fme.timestamp >= '2022-12-18 11:00:00.000' AND
  fme.timestamp <= '2023-01-18 10:59:59.999' AND
  fme.dim_tenant_tenant_id = '4621ed8f-d8a4-46e2-a8de-5710751b16b9'
  group by "conversationid"
) as "fme" on "fme"."conversationid" = "conversation_table"."conversationid"

How can I optimise this complex PostgreSQL query

Answers (1)

Related Questions