How to handle a dimension table with duplicate ids containing slightly different values in data warehouse?

I'm building out the data warehouse at my company and I've encountered a situation where I am pulling in data with slight variations in name but tied to the same ID. This is obviously a problem because my dimension table should only have one record per ID

for example:

+======+===================+
|  id  |      name         |
+======+===================+
|  185 | AAAA              |
+------+-------------------+
|  185 | AAAB              |
+------+-------------------+
|  197 | XXXA              |
+------+-------------------+
|  197 | XXXB              |
+------+-------------------+
|  197 | XXXC              |
+------+-------------------+

As you can see, the ID field should be tied to one unique value but there are strings that have slight variations but tied to the same ID. One thought was to normalize the strings but we would lose some of the metadata. Additionally, I should note that we are using Redshift which is why the unique id constraint is not being enforced. What would be the best solution to this issue?

Upvotes: 0

Answers (3)

user4632558

Reputation:

How about adding a new column in your dimension table and filling it with UUID value? The UUID column acts as a primary key.

In addition, that is the way I keep track of historical data. I think your problem is that someone had modified the records in the source table through time. By using UUID as pkey, we don't need to override the record, so we can keep versioning through time.

Upvotes: 0

0xPeter

Reputation: 596

I don't know if this is the most optimal solution, but it is the solution we chose to accept for our situation. Essentially, I perform a self join on the dimension table and add a column which selects the shortest string per ID. Since typically, the variations between each value is due to data that is appended to the base string, the shortest string should return the base string which is the most important part of the field we are looking for.

Here is the sql code I wrote to perform this:

create table tmp_dim_offers as (
  -- create subquery
  with normalized_dim_offers as (
      select
        t1.id_offer,
        t1.dim_offer_name,
        min(t2.dim_offer_name) as normalized_offer_name
      from dim_offers as t1
        join dim_offers as t2 on t1.id_offer = t2.id_offer
      group by 1, 2
      order by t1.id_offer
  )
  -- select distinct ids and normalized offer name
  select distinct
    normalized_dim_offers.id_offer              as id_offer,
    normalized_dim_offers.normalized_offer_name as dim_offer_name
  from normalized_dim_offers
  order by normalized_dim_offers.id_offer
);

-- drop existing dim_offers table and replace with new normalized table
begin;
alter table dim_offers rename to dim_offers_to_delete;
alter table tmp_dim_offers rename to dim_offers;
drop table dim_offers_to_delete cascade;
commit;

Upvotes: 0

demircioglu

Reputation: 3465

Keep the latest name in the dimension table and create a secondary table for "history" just in case you need the other names in the future. I had a similar situation with a user dimension and implemented the way I described it. You can choose a rule to decide which one is going to be in the dimension table

With your example, the two tables will look like this

dim table
+======+===================+
|  id  |      name         |
+======+===================+
|  185 | AAAB              |
+------+-------------------+
|  197 | XXXC              |
+------+-------------------+

dim_hist table

+======+========+=================+
|  id  | dim_id |    name         |
+======+======+===================+
|  101 | 185  | AAAA              |
+------+------+-------------------+
|  102 | 197  | XXXA              |
+------+------+-------------------+
|  103 | 197  | XXXB              |
+------+------+-------------------+

Using id from dim table you can join two tables and access other names

Upvotes: 2

How to handle a dimension table with duplicate ids containing slightly different values in data warehouse?

Answers (3)

Related Questions