Reputation: 596
I'm building out the data warehouse at my company and I've encountered a situation where I am pulling in data with slight variations in name but tied to the same ID. This is obviously a problem because my dimension table should only have one record per ID
for example:
+======+===================+
| id | name |
+======+===================+
| 185 | AAAA |
+------+-------------------+
| 185 | AAAB |
+------+-------------------+
| 197 | XXXA |
+------+-------------------+
| 197 | XXXB |
+------+-------------------+
| 197 | XXXC |
+------+-------------------+
As you can see, the ID field should be tied to one unique value but there are strings that have slight variations but tied to the same ID. One thought was to normalize the strings but we would lose some of the metadata. Additionally, I should note that we are using Redshift which is why the unique id constraint is not being enforced. What would be the best solution to this issue?
Upvotes: 0
Views: 2410
Reputation:
How about adding a new column in your dimension table and filling it with UUID value? The UUID column acts as a primary key.
In addition, that is the way I keep track of historical data. I think your problem is that someone had modified the records in the source table through time. By using UUID as pkey, we don't need to override the record, so we can keep versioning through time.
Upvotes: 0
Reputation: 596
I don't know if this is the most optimal solution, but it is the solution we chose to accept for our situation. Essentially, I perform a self join on the dimension table and add a column which selects the shortest string per ID. Since typically, the variations between each value is due to data that is appended to the base string, the shortest string should return the base string which is the most important part of the field we are looking for.
Here is the sql code I wrote to perform this:
create table tmp_dim_offers as (
-- create subquery
with normalized_dim_offers as (
select
t1.id_offer,
t1.dim_offer_name,
min(t2.dim_offer_name) as normalized_offer_name
from dim_offers as t1
join dim_offers as t2 on t1.id_offer = t2.id_offer
group by 1, 2
order by t1.id_offer
)
-- select distinct ids and normalized offer name
select distinct
normalized_dim_offers.id_offer as id_offer,
normalized_dim_offers.normalized_offer_name as dim_offer_name
from normalized_dim_offers
order by normalized_dim_offers.id_offer
);
-- drop existing dim_offers table and replace with new normalized table
begin;
alter table dim_offers rename to dim_offers_to_delete;
alter table tmp_dim_offers rename to dim_offers;
drop table dim_offers_to_delete cascade;
commit;
Upvotes: 0
Reputation: 3465
Keep the latest name in the dimension table and create a secondary table for "history" just in case you need the other names in the future. I had a similar situation with a user dimension and implemented the way I described it. You can choose a rule to decide which one is going to be in the dimension table
With your example, the two tables will look like this
dim table
+======+===================+
| id | name |
+======+===================+
| 185 | AAAB |
+------+-------------------+
| 197 | XXXC |
+------+-------------------+
dim_hist table
+======+========+=================+
| id | dim_id | name |
+======+======+===================+
| 101 | 185 | AAAA |
+------+------+-------------------+
| 102 | 197 | XXXA |
+------+------+-------------------+
| 103 | 197 | XXXB |
+------+------+-------------------+
Using id from dim table you can join two tables and access other names
Upvotes: 2