Redshift Copy and auto-increment does not work

Question

I am using the COPY command from redshift to copy json data from S3. The table definition is as follows:

CREATE TABLE my_raw
(
id BIGINT IDENTITY(1,1),
...
...
) diststyle even;

The command for copy i am using is as follows:

COPY my_raw FROM 's3://dev-usage/my/2015-01-22/my-usage-1421928858909-15499f6cc977435b96e610298919db26' credentials 'aws_access_key_id=XXXX;aws_secret_access_key=XXX' json 's3://bemole-usage/schemas/json_schema' ;

I am expecting that any new id inserted will always be > select max(id) from my_raw . In fact it's clearly not the case.

If I issue the above copy command twice, the first time the ids start from 1 to N although that file is creating 114 records(that's a known issue with redshift when it has multiple shards). The second time the ids are also between 1 and N but it took free numbers that were not used in the first copy.

See below for a demo:

usagedb=# COPY my_raw FROM 's3://bemole-usage/my/2015-01-22/my-usage-1421930213881-b8afbe07ab34401592841af5f7ddb31c' credentials 'aws_access_key_id=XXXX;aws_secret_access_key=XXXX' json 's3://bemole-usage/schemas/json_schema' COMPUPDATE OFF;
INFO:  Load into table 'my_raw' completed, 114 record(s) loaded successfully.
COPY
usagedb=# 
usagedb=# select max(id) from my_raw;
 max  
------
 4556
(1 row)

usagedb=# COPY my_raw FROM 's3://bemole-usage/my/2015-01-22/my-usage-1421930213881-b8afbe07ab34401592841af5f7ddb31c' credentials 'aws_access_key_id=XXXX;aws_secret_access_key=XXXX' json 's3://bemole-usage/schemas/my_json_schema' COMPUPDATE OFF;
INFO:  Load into table 'my_raw' completed, 114 record(s) loaded successfully.
COPY
usagedb=# select max(id) from my_raw;
 max  
------
 4556
(1 row)

Thx in advance

isaac.hazan · Accepted Answer

The only solution i found to make sure have sequential Ids based on the insertion is to maintain a pair of tables. The first one is the stage table in which the items are inserted by the COPY command. The stage table will actually not have an ID column.

Then I have another table that is the exact replica of the stage table except that it has an additional column for the Ids. Then there is a job that takes care of filling the master table from the stage using the ROW_NUMBER() function.

In practice, this means executing the following statement after each Redshift COPY is performed:

  insert into master
      (id,result_code,ct_timestamp,...)
  select
      #{startIncrement}+row_number() over(order by ct_timestamp) as id,
      result_code,...
  from stage;

Then the Ids are guaranteed to be sequential/consecutives in the master table.

Redshift Copy and auto-increment does not work

Answers (2)

Related Questions