Reputation: 1953
I am using an AWS Redshift table the holds information about invocations of functions. Each row has a date (of timestamp type), a UID (varchar), and several fields such as duration, error code. The size of the table is ~25 million rows of ~1000 different functions (each with a different UID).
My problem is that simple queries as a count of invocations of several functions in a time window take much time - usually 5-30 seconds.
I have tried different combinations of sort keys and dist key, but the performance seems to remain quite similar:
Setting the function UID as dist key
Setting a compound sort key of the date, the function UID and a combination of both in any order.
I have run VACUUM and ANALYZE on the table. I also tried to add/remove columns compression.
I am using only a single dc2.large node.
EDIT:
The table DDL is:
create table public."invocations_metrics_$mig"(
"function_uid" varchar(256) NOT NULL encode RAW DISTKEY
,"date" timestamp encode zstd
,"duration" double precision encode zstd
,"used_memory" integer encode zstd
,"error" smallint encode zstd
,"has_early_exit" boolean encode zstd
,"request_id" varchar(256) encode zstd
)
SORTKEY(date,function_uid);
An example of a row:
"aca500c9-27cc-47f8-a98f-ef71cbc7c0ef","2018-08-15 13:43:28.718",0.17,27,0,false,"30ee84e1-a091-11e8-ba47-b110721c41bc"
The query:
SELECT
count(invocations_metrics_backup.function_uid) AS invocations,
max(invocations_metrics_backup.date) AS last_invocation,
invocations_metrics_backup.function_uid AS uid
FROM
invocations_metrics_backup
WHERE
function_uid IN (
<10 UIDs>
)
AND DATE >= '2018-08-20T10:55:20.222812'::TIMESTAMP
GROUP BY
function_uid
Total time is 5 seconds. The count in each query is ~5000. For the same query with a ~1M count it takes 30 seconds.
Upvotes: 1
Views: 1021
Reputation: 14035
First, you need to use at least 2 nodes. A single node has to do double duty as leader and compute. With 2 or more nodes you get a free leader node.
Then, change your DDL as follows, removing compression on the sort key:
CREATE TABLE public."invocations_metrics_$mig" (
"function_uid" varchar(256) NOT NULL ENCODE ZSTD,
"date" timestamp ENCODE RAW,
"duration" double precision ENCODE ZSTD,
"used_memory" integer ENCODE ZSTD,
"error" smallint ENCODE ZSTD,
"has_early_exit" boolean ENCODE ZSTD,
"request_id" varchar(256) ENCODE ZSTD
)
DISTSTYLE KEY
DISTKEY( function_uid )
SORTKEY ( date )
;
You may also improve performance by mapping unique UIDs to an integer ID value and using that in your queries. UID values are quite inefficient to work with. The values occur randomly and are relatively wide with very high entropy. They are expensive during sorts, hash aggregations, and hash joins.
Upvotes: 1