Ronyis
Ronyis

Reputation: 1953

Redshift performance for simple time series data

I am using an AWS Redshift table the holds information about invocations of functions. Each row has a date (of timestamp type), a UID (varchar), and several fields such as duration, error code. The size of the table is ~25 million rows of ~1000 different functions (each with a different UID).

My problem is that simple queries as a count of invocations of several functions in a time window take much time - usually 5-30 seconds.

I have tried different combinations of sort keys and dist key, but the performance seems to remain quite similar:

I have run VACUUM and ANALYZE on the table. I also tried to add/remove columns compression.

I am using only a single dc2.large node.

EDIT:

The table DDL is:

create table public."invocations_metrics_$mig"(
"function_uid" varchar(256)  NOT NULL encode RAW DISTKEY
,"date" timestamp    encode zstd 
,"duration" double precision   encode zstd 
,"used_memory" integer   encode zstd 
,"error" smallint   encode zstd 
,"has_early_exit" boolean   encode zstd 
,"request_id" varchar(256)   encode zstd 
)
 SORTKEY(date,function_uid);

An example of a row:

"aca500c9-27cc-47f8-a98f-ef71cbc7c0ef","2018-08-15 13:43:28.718",0.17,27,0,false,"30ee84e1-a091-11e8-ba47-b110721c41bc"

The query:

SELECT
    count(invocations_metrics_backup.function_uid) AS invocations,
    max(invocations_metrics_backup.date) AS last_invocation,
    invocations_metrics_backup.function_uid AS uid
FROM
    invocations_metrics_backup
WHERE
    function_uid IN (
        <10 UIDs>
    )
    AND DATE >= '2018-08-20T10:55:20.222812'::TIMESTAMP
GROUP BY
    function_uid

Total time is 5 seconds. The count in each query is ~5000. For the same query with a ~1M count it takes 30 seconds.

Upvotes: 1

Views: 1021

Answers (1)

Joe Harris
Joe Harris

Reputation: 14035

First, you need to use at least 2 nodes. A single node has to do double duty as leader and compute. With 2 or more nodes you get a free leader node.

Then, change your DDL as follows, removing compression on the sort key:

CREATE TABLE public."invocations_metrics_$mig" (
    "function_uid"   varchar(256) NOT NULL ENCODE ZSTD,
    "date"           timestamp             ENCODE RAW,
    "duration"       double precision      ENCODE ZSTD,
    "used_memory"    integer               ENCODE ZSTD,
    "error"          smallint              ENCODE ZSTD,
    "has_early_exit" boolean               ENCODE ZSTD,
    "request_id"     varchar(256)          ENCODE ZSTD
) 
DISTSTYLE KEY 
DISTKEY( function_uid )
SORTKEY ( date )
;

You may also improve performance by mapping unique UIDs to an integer ID value and using that in your queries. UID values are quite inefficient to work with. The values occur randomly and are relatively wide with very high entropy. They are expensive during sorts, hash aggregations, and hash joins.

Upvotes: 1

Related Questions