Reputation: 23
We have a MIS where stores all the information about Customers, Accounts, Transactions and etc. We are building a data warehouse with BigQuery.
I am pretty new on this topic, Should we
1. everyday extract ALL the customer's latest information and append them to a BigQuery table with timestamp,
2. or we only extract the updated customer's information on that day?
First solution uses a lot of storage and takes time to upload data, and got lots of duplicates. But it's very clear for me to run query. For 2nd solution, given a specific date how can I get the latest record for that day?
Similar for Account data, an example of simplified Account table, only 4 fields here.
AccountId, CustomerId, AccountBalance, Date
If I need to build a report or graphic of a group of customers' AccountBalance everyday, I need to know the balance of each account on every specific date. So should I extract each account record everyday, even it's the same as last day, or I can only extract the account when the balance changed?
What is the best solution or your suggestion? I prefer the 2nd one because there are no duplicates, but how can I construct the query in BigQuery, will performance be an issue?
What else should I consider? Any recommendation for me to read?
Upvotes: 2
Views: 1421
Reputation: 3172
When designing DWH you need to start from business questions, translate them to KPIs, measures, dimensions, etc. When you have those in place... you chose technology based on some of the following questions (and many more): who are your users? in what frequency and what resolutions they consume the data? what are your data sources? are they structured? what are the data volumes? what is your data quality? how often your data structure changes? etc.
when choosing technology you need to think of the following: ETL, DB, Scheduling, Backup, UI, Permissions management, etc.
after you have all those defined... data schema design is pretty straight forward and is derived from "The purpose of the DWH" and your technology limits.
You have pointed out some of the points to consider, but the answer is based of your needs... and is not related to specific DB technology.
I am afraid your question is too general to be answered without deep understanding of your needs.
Referring to your comment bellow: How reliable is your source data? Are you interested in the analyzing trends or just snapshots? Does your source system allow "Select all" operations? what are the data volumes? What resources does your source allow for extraction (locks, bandwidth, etc.)? If you just need a daily snapshot of the current balance, and there are no limits by your source system,
it would be much simpler to run a daily snapshot.
this way you don't need to manage "increments", handle data integrity issues and systems discrepancies etc. however, this approach might have undesired impact on your source system, and your network costs... If you do have resources limits, and you chose the incremental ETL approach, you can either
create a "Changes log" table and query it, you can use row_number() in order to find latest record per account.
each approach has its own aspect of simplicity, costs, and resource consumption... Hope this helps
Upvotes: 1