Reputation: 41
I’m not sure what would be the best way to arrange my data in a star schema structure with Postgres.
Here are my tables:
1) pages - holds data on all page visits (every page a visitor clicked on during his visit to the website) 2) sessions - holds data on sessions (e.g each time visitor has visited the website) 3) countries - list of the counties 4) session_users - list of the users and attributes 5) daily aggregated table based on the tables above
thanks
pages user_id session_id visitor_id session_country_id page website_type visitor_type seniority page_number end_date
sessions user_id session_id visitor_id visit_country_id operating_system_id days_since_first_session session_start_date session_end_date is_returning
session_user user_id signup_date account_user_id purchase_date
countries country_id country_name
session_user user_id signup_date account_user_id purchase_date
daily aggregated table date num_of_visitor num_of_users num_of_sessions avg_num_of_pages advanced pricing
Upvotes: 2
Views: 7236
Reputation: 3118
The best way depends on many things that we do not know, but you hopefully do (or should find out!) including the amount of data, how often it gets updated, what systems or people interact with the database, how you and your users will be using whatever your solution is and above all what you're trying to achieve with your data warehouse.
Here are a couple of places to start research:
https://en.wikipedia.org/wiki/Star_schema
Very briefly, a star schema has one or more fact tables with lots of records, and more dimension tables with fewer records. The fact tables contain the things you're measuring or counting, and the dimensions contain characteristics that you want to use to summarise the data. The primary key in each dimension table is related to a key in the fact table - each combination of dimension keys in the fact table is unique.
So, to start with, what are the facts you're measuring? Revenue? Hits? or what? Then, what are your dimensions and what characteristics do you have per dimension? A good place to get answers to that is by thinking about the business problem you're solving. Once you have those answers the table design follows. It's well worth reading a couple of books on data warehouse design to get some hints on e.g. modelling date and time dimensions.
In terms of Postgresql, your table creation queries could look something like this:
create table dimension1 (
dimension1_key serial primary key,
attribute1 text not null default 'unknown',
attribute2 text not null default 'unknown');
create table dimension2 (
dimension2_key serial primary key,
attribute1 text not null default 'unknown',
attribute2 text not null default 'unknown');
create table fact1 (
dimension1_key integer references dimension1,
dimension2_key integer references dimension2,
fact integer,
constraint pk primary key (dimension1_key,dimension2_key)
)
You'll usually put indexes on the columns you'll summarise:
create index dimension2_attribute1 on dimension2(attribute1);
I don't think it's possible or even desirable to give you a more complete answer than this. I think it's important that you understand warehouse design for yourself, for which you're going to have to do some reading.
Upvotes: 5