Reputation: 27
For a university group project, we are building a data warehouse using SQL Server and Visual Studio. We are currently in the staging area and want to fill the tables with data.
However, we noticed that in our fact sales table the sum of the price + the sum of the freight value is different from the payment_value as well as the values differ from the csv file. That is why we tried to additionally check the box “Remove rows with duplicate sort values” in the Sort 1 (see files). Without checking that option we end up with
117.216 rows in the end and with checking it, we end up with 102.727 rows. Yet, there is still a variation between all of the values. For a better overview and understanding, please refer to the following image:
We found out that the duplications are due that there could be more than one payment per order, which means there is is a model design issue. Payment should actually be another fact table, separate from sales.
We were trying to create a fact table for payment, however we are not sure how to do it, since we thought it is not possible to have two fact tables connected to each other. Could you maybe help us out here, because Payments should be connected to Sales, however Sales is a fact table, so we are not sure how to proceed.
Another problem, we our facing is that if Payment will be a fact table, we are losing a dimension. In our guidelines we are required to have 5 dimensions and we would not know what other dimension to create (could also be made up). If you have any suggestions on how to solve that correlating problem, we would be more than happy.
For a better understanding and insight, we are providing you with our SQL scripts, flat files as well as visual studio data flow: Download files
Any help is much appreciated! Thank you very much in advance! :)
Upvotes: 0
Views: 201
Reputation: 10875
You can still have payments as a dimension to stay within the constraints of the project. One thing you can do to deal with the many to many relationship between orders and payments is what is called a table, so your schema would be like this:
CREATE DATABASE [OLIST_STORE_STG]
GO
USE [OLIST_STORE_STG]
GO
/* the dimension table for Customers: stg_dim_customer */
CREATE TABLE Stg_Dim_Customer (
BK_Customer_unique NVARCHAR (50) PRIMARY KEY,
Customer_zip_code_prefix INT NOT NULL,
Customer_city NVARCHAR(50) NOT NULL,
Customer_state NVARCHAR(50) NOT NULL,
);
/* Object: The dimension table for products: Stg_Dim_Product */
CREATE TABLE Stg_Dim_Product (
BK_Product NVARCHAR(50) PRIMARY KEY,
Product_category_name_english NVARCHAR(50) NOT NULL,
Product_photos_qty INT NULL,
Product_name_lenght INT NULL,
Product_description_lenght INT NULL,
Product_weight_g INT NULL,
Product_length_cm INT NULL,
Product_height_cm INT NULL,
Product_width_cm INT NULL,
);
/* Object: The dimension table for sellers: Stg_Dim_Sellers */
CREATE TABLE Stg_Dim_Sellers(
BK_Sellers NVARCHAR (50) PRIMARY KEY,
Sellers_zip_code_prefix INT NOT NULL,
Sellers_city NVARCHAR (50) NOT NULL,
Sellers_state NVARCHAR (50) NOT NULL,
);
/* The dimension table for sellers: Stg_Dim_Payment */
CREATE TABLE Stg_Dim_Payment(
BK_Payment INT PRIMARY KEY,
Payment_sequential INT NOT NULL,
Payment_type NVARCHAR (50) NOT NULL,
Payment_installments INT NOT NULL
);
/* The dimension table for Date: Stg_Dim_Date */
CREATE TABLE Stg_Dim_Date(
SK_Date INT PRIMARY KEY,
Full_date DATE NOT NULL,
Day_number INT NOT NULL,
Day_name NVARCHAR (20) NOT NULL,
Month_number INT NOT NULL,
Month_name NVARCHAR (50) NOT NULL,
Trimester_number INT NOT NULL,
Trimester_name NVARCHAR (50) NOT NULL,
Year INT NOT NULL
);
/* The Facts Table for Sales: Stg_Fact_Sales */
CREATE TABLE Stg_Fact_Sales (
FK_Date DATE ,
FK_Product NVARCHAR(50) ,
FK_Sellers NVARCHAR(50),
FK_Customer_unique NVARCHAR(50) ,
Delays INT NULL,
Price DECIMAL(18, 2) NOT NULL,
Freight_value DECIMAL(18, 2) NOT NULL,
Payment_value DECIMAL (18,2) NOT NULL,
Order_status NVARCHAR(50) NOT NULL,
Order_reference NVARCHAR (50) NOT NULL,
Order_item_reference INT NOT NULL,
CONSTRAINT pk_Fact_Sales PRIMARY KEY (
Order_reference ASC,
Order_item_reference ASC
)
);
/* The Facts Table for Reviews: Stg_Fact_Reviews */
CREATE TABLE Stg_Fact_Reviews (
FK_Date DATE NOT NULL,
FK_Customer_unique NVARCHAR(50) NOT NULL,
Average_review_score DECIMAL(18,2) NOT NULL,
CONSTRAINT pk_Fact_Reviews PRIMARY KEY (
FK_Date ASC,
FK_Customer_unique ASC
)
);
/* Bridge table to relate orders to payments */
CREATE TABLE Stg_Order_Payments (
FK_Order_reference NVARCHAR (50),
FK_Payment NVARCHAR (50)
);
Of course, this will need to be reflected in the SSIS package: remove the merge between payments and orders when generating fact_sales and add a new data flow from payments to stg_order_payments
Other Design Considerations:
Payments are most commonly held in a fact table. If you need more dimensions for the constraints of the project, you can for example make a Zip Codes table, so you don't have to repeat city and state in Customers and Sellers dimensions, but just a foreign key to the table Zip codes. Also Categories could be another dimension, suppose that in the future, there also is translation to French or any other language; it would be better to have the Product category in Portuguese and a category table with the Portuguese name and all the translations. Even better, you can create a numeric Category Code as a FK in product to the table Category.
Upvotes: 1