Redshift performance for simple time series data

Question

I am using an AWS Redshift table the holds information about invocations of functions. Each row has a date (of timestamp type), a UID (varchar), and several fields such as duration, error code. The size of the table is ~25 million rows of ~1000 different functions (each with a different UID).

My problem is that simple queries as a count of invocations of several functions in a time window take much time - usually 5-30 seconds.

I have tried different combinations of sort keys and dist key, but the performance seems to remain quite similar:

Setting the function UID as dist key
Setting a compound sort key of the date, the function UID and a combination of both in any order.

I have run VACUUM and ANALYZE on the table. I also tried to add/remove columns compression.

I am using only a single dc2.large node.

EDIT:

The table DDL is:

create table public."invocations_metrics_$mig"(
"function_uid" varchar(256)  NOT NULL encode RAW DISTKEY
,"date" timestamp    encode zstd 
,"duration" double precision   encode zstd 
,"used_memory" integer   encode zstd 
,"error" smallint   encode zstd 
,"has_early_exit" boolean   encode zstd 
,"request_id" varchar(256)   encode zstd 
)
 SORTKEY(date,function_uid);

An example of a row:

"aca500c9-27cc-47f8-a98f-ef71cbc7c0ef","2018-08-15 13:43:28.718",0.17,27,0,false,"30ee84e1-a091-11e8-ba47-b110721c41bc"

The query:

SELECT
    count(invocations_metrics_backup.function_uid) AS invocations,
    max(invocations_metrics_backup.date) AS last_invocation,
    invocations_metrics_backup.function_uid AS uid
FROM
    invocations_metrics_backup
WHERE
    function_uid IN (
        <10 UIDs>
    )
    AND DATE >= '2018-08-20T10:55:20.222812'::TIMESTAMP
GROUP BY
    function_uid

Total time is 5 seconds. The count in each query is ~5000. For the same query with a ~1M count it takes 30 seconds.

Redshift performance for simple time series data

Answers (1)

Related Questions