Chicago1988
Chicago1988

Reputation: 684

DW acceptable left join?

In a report I have the next join from a FACT table:

Join…

LEFT JOIN DimState AS s 
ON s.StateCode = l.Province  AND l.Locale LIKE (s.CountryCode + '%') 

More information: Fact table has 59,567,773 rows

L.Province can match a StateCode in DimState: 42,346,471 rows 71%

L.Province can’t match a StateCode in DimState: 13,742,966 rows 23% (most of them are a blank value in L.Province).

L.Province is NULL in 3,500,000 rows (6%)

4 questions: -The correct thing to do, would be to replace L.Province Nulls and blanks for “other”… And have an entry in DimState, with StateCode “other”, right?

-Is it acceptable to LEFT JOIN to a dimension? Or it should always be INNER JOIN?

-Is it correct to join to a dimension on 2 columns?

-To do a l.Locale = s.CountryCode… Should I modify the values in l.Locale or in s.CountryCode?

Upvotes: 0

Views: 2120

Answers (3)

Wes H
Wes H

Reputation: 4439

To build on what mallan1121 said:

1:There are generally three different meanings for null/blank in data warehousing.

A. I don't know the value

B. The value is known and it is blank

C. The value does not apply.

Make sure you consider the relevance for each option as you design your warehouse. The fact should ALWAYS reference a dimension key or you will end up with data quality issues.

2: It can be useful to use left joins if you are abstracting your tables from your cube using views (a good idea) and if you may use those views for non-cube reporting. The reason is that an inner join is a filtering join and the result set is filtered by all inner joined tables even if only a single column is returned.

SELECT DimA.COLUMN, Fact.COLUMN
FROM Fact
JOIN DimA 
JOIN DimB --filters result
JOIN DimC --filters result

If you use a left join and you only want columns from the some of the tables, the other joins are ignored and those tables are never accessed.

SELECT DimA.COLUMN, Fact.COLUMN
FROM Fact
LEFT JOIN DimA
LEFT JOIN DimB --ignored
LEFT JOIN DimC --ignored

This can speed up reporting querys run directly against the SQL database. However, you must make sure your ETL process enforces the integrity and that the results returned are identical whether inner or left joins are used.

4: Requiring multiple columns in the join is not a problem, but I'd be very concerned about a multiple column join using a wildcard. I expect you have a granularity issue in your dimension. I don't know your data, but using a wildcard risks getting multiple values back from that dimension.

Upvotes: 2

Marmite Bomber
Marmite Bomber

Reputation: 21085

  1. Do not do this from one simple reason. You will get 13M records with the key L.Province = 'Other' in you dimension table - each record from the fact table with s.StateCode = 'Other' will be joined with those 13M dimension records, leading to massive duplication of the measures.

The proper answer is enforce the primary key on your dimension. Typically a dimnsion have one record with the key other (meaning the key is not known) and possible one other recrod NA (the dimension has no meaning in for this fact record).

  1. The problem is not in the OUTER join- what should be enforced by design is that all foreign key in the fact table are defined in the dimension table.

    One step to achieve this is the definition of NA and Other as decribed in 1.

The rationale behind this approach is to enforce that INNER and OUTER joins lead to the same result, i.e. do not cause confusion with different results.

  1. Again each dimension should have defined a PRIMARY KEY - if the PK consist of two columns - the join on those columns is fine. (Typical scenario in DWh though is a single column numeric PK).

What should be avioded is join on LIKEor SUBSTR - this points that the dimension PK is not well defined.

  1. If your dimension has a two column PK Locale + province the fact table must be updated to contain this two column as a FK.

Upvotes: 0

mallan1121
mallan1121

Reputation: 489

In order of your four questions:

  1. Yes, you should not have blanks for dimension keys in your fact tables. If the value in the source data is in fact null or empty, there should be members in your dimension tables which are set aside to reflect this.

  2. Therefore, building off 1, you should GENERALLY not do left joins when joining facts to dimensions. I say generally because there might be a situation where this is necessary, but I can't think of anything of the top of my head. You should not have to with properly designed fact and dimension tables.

  3. Generally, no. I would recommend using a surrogate key in this case since your business key is spread across two columns.

  4. Not sure what you are asking here. If you keep this design, you would need to change both. If you switch to using a surrogate key for DimState, you would only have to update the dimension table whenever anything changes.

Upvotes: 2

Related Questions