Reputation: 173
I have been given a complex PostgreSQL query, and been asked to optimise it. I've made some progress in establishing where the problem might be, however I'm running out of ideas for how to approach optimising it. I'm hoping for some pointers/suggestions.
This query is part of an "AI Chatbot" system, it returns the number of unique conversations had with the chatbot, with a few filters. The data returned looks like:
conversationId | userId | userName | channel | eventCount | operator | firstMessageDate | lastMessageDate | totalUniqueUsers |
---|---|---|---|---|---|---|---|---|
Id of the conversation | Id of the user | Name of the user | Number of events that occured during conversation | If true, the conversation was with an "admin" or inhouse person | Date of the first message in the conversation | Date of the last message in the conversatrion | Total number of unique user conversations |
The problem with the query seems to be related to a specific 'inner join' clause, joining on a large (10540420 rows) table.
This is the query with the inner join commented out, this way it will run in around 2 seconds.
select
"conversation_table"."conversationid" as "conversationId",
"du"."userid" as "userId",
"du"."username" as "userName",
"du"."channel" as "channel",
COALESCE(user_events.count, 0):: integer as "eventCount",
BOOL_OR(dc.operator_messages > 0) as "operator",
DATE_PART(
'epoch', dc.first_message_timestamp
)* 1000 as "firstMessageDate",
DATE_PART(
'epoch', dc.last_message_timestamp
)* 1000 as "lastMessageDate",
COUNT(*) OVER():: integer as "totalUniqueUsers"
from
"fact_conversation_data" as "conversation_table"
inner join "dim_user" as "du" on "conversation_table"."dim_user_user_id" = "du"."user_id"
inner join "dim_time" as "dt" on "conversation_table"."dim_time_time_id" = "dt"."time_id"
inner join "dim_conversation" as "dc" on "conversation_table"."conversationid" = "dc"."conversationid"
-- inner join (
-- select
-- "conversationid"
-- from
-- "fact_milestone_event" as fme
-- where
-- fme.dim_segment_segment_id in ('20736b82-4515-411f-9bc8-cf4d84ad69ac')
-- group by "conversationid"
-- ) as "fme" on "fme"."conversationid" = "conversation_table"."conversationid"
left join (
select
"fme"."conversationid",
count("fme"."event_id")
from
"fact_milestone_event" as "fme"
where
"fme"."timestamp" >= '2022-12-18 11:00:00.000'
and "fme"."timestamp" <= '2023-01-18 10:59:59.999'
and "fme"."dim_tenant_tenant_id" = '4621ed8f-d8a4-46e2-a8de-5710751b16b9'
group by
"fme"."conversationid"
) as "user_events" on "conversation_table"."conversationid" = "user_events"."conversationid"
where
"conversation_table"."timestamp" >= '2022-12-18 11:00:00.000'
and "conversation_table"."timestamp" <= '2023-01-18 10:59:59.999'
and "dt"."bot_zone" = 'Pacific/Auckland'
and "du"."is_platform_user" <> true
and "conversation_table"."conversationid" in (
select
"notEmptyConservation_table"."conversationid"
from
(
select
sum("user_messages") as "total_user_message_count",
"conversationid"
from
"fact_conversation_data"
where
"fact_conversation_data"."dim_tenant_tenant_id" = '4621ed8f-d8a4-46e2-a8de-5710751b16b9'
and "fact_conversation_data"."timestamp" >= '2022-12-18 11:00:00.000'
and "fact_conversation_data"."timestamp" <= '2023-01-18 10:59:59.999'
group by
"conversationid"
) as "notEmptyConservation_table"
where
"notEmptyConservation_table"."total_user_message_count" > 1
)
and "conversation_table"."dim_tenant_tenant_id" = '4621ed8f-d8a4-46e2-a8de-5710751b16b9'
and (
"conversation_table"."user_messages" > 0
)
group by
"user_events"."count",
"conversation_table"."conversationid",
"du"."userid",
"du"."username",
"du"."channel",
"dc"."first_message_timestamp",
"dc"."last_message_timestamp"
order by
"lastMessageDate" asc
limit
20
However, if I uncomment the inner join, I get a runtime of about 9 minutes.
Here's a link to the SLOW query and execution plan in explain.depesz.com.
Here's a link for the faster version of the query, without the problematic inner join.
Here are the indexes for the two main tables involved, fact_milestone_event
and fact_conversation_data
.
Table Name | Index Name | Index Definition |
---|---|---|
fact_conversation_data | fact_conversation_data_conversationid_idx | CREATE INDEX fact_conversation_data_conversationid_idx ON public.fact_conversation_data USING btree (conversationid) |
fact_conversation_data | fact_conversation_data_tenant_id_idx | CREATE INDEX fact_conversation_data_tenant_id_idx ON public.fact_conversation_data USING btree (dim_tenant_tenant_id) |
fact_conversation_data | fact_conversation_data_timestamp_idx | CREATE INDEX fact_conversation_data_timestamp_idx ON public.fact_conversation_data USING btree (timestamp) |
fact_conversation_data | conversationid_timeid_unique_idx | CREATE UNIQUE INDEX conversationid_timeid_unique_idx ON public.fact_conversation_data USING btree (conversationid, dim_time_time_id) |
fact_conversation_data | fact_conversation_data_pk | CREATE UNIQUE INDEX fact_conversation_data_pk ON public.fact_conversation_data USING btree (conversation_data_id) |
fact_milestone_event | fact_milestone_event_dim_segment_segment_id_idx | CREATE INDEX fact_milestone_event_dim_segment_segment_id_idx ON public.fact_milestone_event USING btree (dim_segment_segment_id) |
fact_milestone_event | fact_milestone_event_time_id_idx | CREATE INDEX fact_milestone_event_time_id_idx ON public.fact_milestone_event USING btree (dim_time_time_id) |
fact_milestone_event | fact_milestone_event_dim_milestone_milestone_id_idx | CREATE INDEX fact_milestone_event_dim_milestone_milestone_id_idx ON public.fact_milestone_event USING btree (dim_milestone_milestone_id) |
fact_milestone_event | fact_milestone_event_conversationid_idx | CREATE INDEX fact_milestone_event_conversationid_idx ON public.fact_milestone_event USING btree (conversationid) |
fact_milestone_event | fact_milestone_event_timestamp_idx | CREATE INDEX fact_milestone_event_timestamp_idx ON public.fact_milestone_event USING btree (timestamp) |
fact_milestone_event | fact_milestone_event_dim_tenant_tenant_id_idx | CREATE INDEX fact_milestone_event_dim_tenant_tenant_id_idx ON public.fact_milestone_event USING btree (dim_tenant_tenant_id) |
fact_milestone_event | fact_milestone_event_dim_user_user_id_idx | CREATE INDEX fact_milestone_event_dim_user_user_id_idx ON public.fact_milestone_event USING btree (dim_user_user_id) |
fact_milestone_event | fact_milestone_event_pk | CREATE UNIQUE INDEX fact_milestone_event_pk ON public.fact_milestone_event USING btree (event_id) |
I've tried replacing the inner join with a WHERE clause with a nested select. I've tried running EXPLAIN
against the query, but had trouble understanding the output. I've tried adding more conditions to the inner join (as below).
inner join (
select "conversationid"
from "fact_milestone_event" as fme
where
fme.dim_segment_segment_id in ('20736b82-4515-411f-9bc8-cf4d84ad69ac') AND
fme.timestamp >= '2022-12-18 11:00:00.000' AND
fme.timestamp <= '2023-01-18 10:59:59.999' AND
fme.dim_tenant_tenant_id = '4621ed8f-d8a4-46e2-a8de-5710751b16b9'
group by "conversationid"
) as "fme" on "fme"."conversationid" = "conversation_table"."conversationid"
Upvotes: 0
Views: 155
Reputation: 164639
I think the sub-selects can be eliminated from the joins and their logic moved into the larger query. Rather than repeating the constraints, check for equality.
inner join fact_milestone_event as fme on
fme.conversationid = conversation_table.conversationid
and fme.dim_tenant_tenant_id = conversation_table.dim_tenant_tenant_id
and fme.timestamp = conversation_table.timestamp
where
conversation_table.timestamp >= '2022-12-18 11:00:00.000'
and conversation_table.timestamp <= '2023-01-18 10:59:59.999'
and conversation_table.dim_tenant_tenant_id = '4621ed8f-d8a4-46e2-a8de-5710751b16b9'
The logic in notEmptyConservation_table can use the existing fme join.
Upvotes: 0