isaac.hazan
isaac.hazan

Reputation: 3864

Redshift Copy and auto-increment does not work

I am using the COPY command from redshift to copy json data from S3. The table definition is as follows:

CREATE TABLE my_raw
(
id BIGINT IDENTITY(1,1),
...
...
) diststyle even;

The command for copy i am using is as follows:

COPY my_raw FROM 's3://dev-usage/my/2015-01-22/my-usage-1421928858909-15499f6cc977435b96e610298919db26' credentials 'aws_access_key_id=XXXX;aws_secret_access_key=XXX' json 's3://bemole-usage/schemas/json_schema' ; 

I am expecting that any new id inserted will always be > select max(id) from my_raw . In fact it's clearly not the case.

If I issue the above copy command twice, the first time the ids start from 1 to N although that file is creating 114 records(that's a known issue with redshift when it has multiple shards). The second time the ids are also between 1 and N but it took free numbers that were not used in the first copy.

See below for a demo:

usagedb=# COPY my_raw FROM 's3://bemole-usage/my/2015-01-22/my-usage-1421930213881-b8afbe07ab34401592841af5f7ddb31c' credentials 'aws_access_key_id=XXXX;aws_secret_access_key=XXXX' json 's3://bemole-usage/schemas/json_schema' COMPUPDATE OFF;
INFO:  Load into table 'my_raw' completed, 114 record(s) loaded successfully.
COPY
usagedb=# 
usagedb=# select max(id) from my_raw;
 max  
------
 4556
(1 row)

usagedb=# COPY my_raw FROM 's3://bemole-usage/my/2015-01-22/my-usage-1421930213881-b8afbe07ab34401592841af5f7ddb31c' credentials 'aws_access_key_id=XXXX;aws_secret_access_key=XXXX' json 's3://bemole-usage/schemas/my_json_schema' COMPUPDATE OFF;
INFO:  Load into table 'my_raw' completed, 114 record(s) loaded successfully.
COPY
usagedb=# select max(id) from my_raw;
 max  
------
 4556
(1 row)

Thx in advance

Upvotes: 4

Views: 5123

Answers (2)

linqu
linqu

Reputation: 11970

I can't reproduce your problem, however it is interesting how you have identity columns set correctly in conjunction with copy. Here a small summary:

Be aware that you can specify the columns (and their order) for a copy command.

COPY my_table (col1, col2, col3) FROM s3://...

So if:

  • EXPLICIT_IDS flag is NOT set
  • no columns listed like shown above
  • and you csv does not contain data for the IDENTITY column

then the identity values in the table will be set automatically in monotonously as we all want it.

doc:

If an IDENTITY column is included in the column list, then EXPLICIT_IDS must also be specified; if an IDENTITY column is omitted, then EXPLICIT_IDS cannot be specified. If no column list is specified, the command behaves as if a complete, in-order column list was specified, with IDENTITY columns omitted if EXPLICIT_IDS was also not specified.

Upvotes: 1

isaac.hazan
isaac.hazan

Reputation: 3864

The only solution i found to make sure have sequential Ids based on the insertion is to maintain a pair of tables. The first one is the stage table in which the items are inserted by the COPY command. The stage table will actually not have an ID column.

Then I have another table that is the exact replica of the stage table except that it has an additional column for the Ids. Then there is a job that takes care of filling the master table from the stage using the ROW_NUMBER() function.

In practice, this means executing the following statement after each Redshift COPY is performed:

  insert into master
      (id,result_code,ct_timestamp,...)
  select
      #{startIncrement}+row_number() over(order by ct_timestamp) as id,
      result_code,...
  from stage;

Then the Ids are guaranteed to be sequential/consecutives in the master table.

Upvotes: 1

Related Questions