How to create collision-less unique key in Data Warehouse?

We are integrating tables from a lot of different systems in a relational warehouse, before making it dimensional in a Data Mart. But right now, we are really focusing on the relational warehouse step. We are working with Snowflake Cloud Platform.

Data is comming from different tables and we decided to create a table for localization (translations) in our warehouse. All necessary translations will be in this table. If we only use the given ID for each entities, we might get id collision and non-unique ID.

As an example, we might have Countries and Companies in the same localization table sharing the same id which were unique in their respective tables, but not anymore in this joint table :

origin_table	language_id	name	id_in_origin_table
Country	100	Germany	101
Country	200	Allemagne	101
...	...	...	...
Company	100	Mecanic Shop	101
Company	200	Garage Mécanique	101

Obviously we can't use id_in_origin_table as a unique key, neither can we CONCAT(language_id, id_in_origin_table).

So we thought of adding a column which would try to be unique by hashing the origin table name + the value.

Simplified example : hash('Country', country.id)

We do it both in the origin table and localization table to be able to have a relationship. But as you may know, most hash functions produces alphanumeric values which are less efficient to join than only numeric values. We tried swaping hex value by their decimal counterpart, then doing an arithmetic operation to shrink the 128 bits CHAR value to 64 bits BIGINT. But we figure there would we quite high risks of collisions.

Really important for us : We want the keys to be numerical values only.

So the questions are:

Is there an efficient way to do generate numerical unique keys?
If it implies hashing, which algorithm should be favored?
Is there natively a function in Snowflake that you know of, that could help us in this scenario?
Does anyone have a similar problem, or are we overthinking / missing the whole picture somehow?

Thanks a lot! I appreciate you feedback!

Upvotes: 0

Answers (2)

David Clausen

Reputation: 21

Take a look at using sequences. If you use the same sequence across multiple tables you will have a unique ID. Depending on the details, this may eliminate the need to create a table to manage your keys.

https://docs.snowflake.com/en/user-guide/querying-sequences.html

Upvotes: 2

Marmite Bomber

Reputation: 21075

What you describe is not really a surrogate key. A surrogate key should remain unchanged if you alter a dimension attribute (e.g. name) which is not happening in your case.

A classical approach to maintain the surrogate keys is a mapping table that contain the natural key (table_name, lang Id and Id in your case) and the surrogate_id. The table is used for the transformation natural key -> surrogate key via lookup and you insert a record for each new natural key with a sequence assigned surrogate key.

This should be your benchmark, use an alternative only if you find a solution better than the above one.

Natural key hashing (which BTW does not produce a surrogate key, as per definition a surrogate key must be able to remain unchanged if the natural key changes) was propagated by Data Vault. But if you opt for Data Vault you should not care for such details as the hash format.

The argument, that you get an independence between the key definition in the dimension and the usage in a fact table is valid, but you have to pay a price for it.

Upvotes: 1

How to create collision-less unique key in Data Warehouse?

Answers (2)

Related Questions