Dan Ciborowski - MSFT
Dan Ciborowski - MSFT

Reputation: 7207

Amazon RedShift: Unique Column not being honored

I use the following query to create my table.

create table t1 (url varchar(250) unique);

Then I insert about 500 urls, twice. I am expecting that the second time I had the URLs that no new entries show up in my table, but instead my count value doubles for:

select count(*) from t1;

What I want is that when I try and add a url that is already in my table, it is skipped. Have I declared something in my table deceleration incorrect?

I am using RedShift from AWS.

Sample

urlenrich=# insert into seed(url, source) select 'http://www.google.com', '1';
INSERT 0 1
urlenrich=# select * from seed;
          url          | wascrawled | source | date_crawled 
-----------------------+------------+--------+--------------
 http://www.google.com |          0 |      1 | 
(1 row)

urlenrich=# insert into seed(url, source) select 'http://www.google.com', '1';
INSERT 0 1
urlenrich=# select * from seed;
          url          | wascrawled | source | date_crawled 
-----------------------+------------+--------+--------------
 http://www.google.com |          0 |      1 | 
 http://www.google.com |          0 |      1 | 
(2 rows)

Output of \d seed

urlenrich=# \d seed

                  Table "public.seed"
    Column    |            Type             | Modifiers 
--------------+-----------------------------+-----------
 url          | character varying(250)      | 
 wascrawled   | integer                     | default 0
 source       | integer                     | not null
 date_crawled | timestamp without time zone | 
Indexes:
    "seed_url_key" UNIQUE, btree (url)

Upvotes: 5

Views: 13328

Answers (5)

Ricardo Edo
Ricardo Edo

Reputation: 61

In redshift, constraints are recommended but doesn't take effect, constraints will just help to the query planner to select better ways to perform the query.

Usually, columnar databases do not manage indexes or constraints.

Upvotes: 4

Dave Poole
Dave Poole

Reputation: 1068

Primary and unique key enforcement in distributed systems, never mind column store systems, is difficult. Both RedShift (Paracel) and Vertica face the same problems.

The challenge with a column store is that the question that is being asked is "does this table row have a relevant entry in another table row" but column stores are not designed for row operations.

In HP Vertica there is an explicit command to report on constraint violations. In Redshift it appears that you have to roll your own.

    SELECT COUNT(*) AS TotalRecords, COUNT(DISTINCT {your PK_Column}) AS UniqueRecords
    FROM {Your table}
    HAVING COUNT(*)> COUNT(DISTINCT {your PK_Column}) 

Obviously, if you have a multi-column PK you have to do something more heavyweight.

SELECT COUNT(*)
FROM (
    SELECT {PkColumns}
    FROM {Your Table}
    GROUP BY {PKColumns}
    HAVING COUNT(*)>1

) AS DT

If the above returns a value greater than zero then you have a primary key violation.

Upvotes: 1

William King
William King

Reputation: 1268

For anyone who:

  • Needs to use redshift
  • Wants unique inserts in a single query
  • Doesn't care too much about query performance
  • Only really cares about inserting a single unique value at a time

Here's an easy way to get it done

INSERT INTO MY_TABLE (MY_COLUMNS) 
    SELECT MY_UNIQUE_VALUE WHERE MY_UNIQUE_VALUE NOT IN (
        SELECT MY_UNIQUE_VALUE FROM MY_TABLE 
        WHERE MY_UNIQUE_COLUMN = MY_UNIQUE_VALUE
    )

Upvotes: -1

Dan Ciborowski - MSFT
Dan Ciborowski - MSFT

Reputation: 7207

Figured out the problem

Amazon RedShift does not enforce constraints...

As explained here http://docs.aws.amazon.com/redshift/latest/dg/t_Defining_constraints.html

They said they may get around to changing it at some point.

NEW 11/21/2013 RDS has added support for PostGres, if you need unique and such an postgres rds instance is now the best way to go.

Upvotes: 12

Masashi M
Masashi M

Reputation: 2757

Although Amazon Redshift doesn't support unique constraints, there are some ways to delete duplicated records that can be helpful. See the following link for the details.

copy data from Amazon s3 to Red Shift and avoid duplicate rows

Upvotes: 1

Related Questions