Star Schema Design for Social Media

Question

I am new to dimensional modeling and have read a lot of material (star-schema, dimension/fact tables, SCD, Ralph Kimball's - The Data Warehouse Toolkit book, etc). So I have a good conceptual understanding of dimensional modeling constructs but finding it hard to apply to a usecase due to lack of experience and need some guidance.

Consider Twitter for example, I want to design a dimensional model to calculate -

DAU (Daily active users) = number of users who logged in and accessed twitter via website or mobile app on a given day
MAU (Monthly active users) = number of users who logged in and accessed twitter via website or mobile app in last 30 days including measurement date
User engagement = total(clicks + favorites + replies + retweets) on a tweet

These metrics over a period (like month) is the summation of these metrics on each day in that period.

I want to write SQLs to calculate these metrics for every quarter by region (eg: US and rest of world) and calculate year-over-year growth (or decline) in these metrics.
Eg:

Here are some details that I thought about -

Factless (transaction) fact table for user login activity with grain of 1 row per user per login : user_login_fact_schema (user_dim_key, date_dim_key, user_location_dim_key, access_method_dim_key)

Factless (transaction) fact table for user activity with grain of 1 row per user per activity : user_activity_fact_schema (user_dim_key, date_dim_key, user_location_dim_key, access_method_dim_key, post_key, activity_type_key)

Does this sounds correct? How should my model look like? What other dimensions/facts can I add here?

Wonder if I should collapse these 2 tables into 1 and have activity_type for logins as 'login', but there can be a huge number of logins without any activity so this will skew the data. Am I missing anything else?

jmng · Accepted Answer

Your model seems correct, it answers the questions on the graph you posted.

It could make sense to aggregate those two fact tables into one fact table joined with a "UserAction" dimension, mostly because a login can be interpreted as just another user action.

However, having separate fact tables focused on one metric (or process) may be preferable because it enables you to introduce measures/metrics into the tables, i.e. when your fact tables stop being factless. It also spares you a join with another dimension (UserAction) but that is becoming a bit less relevant these days, where storage and DB processing power are just getting cheaper.

Star Schema Design for Social Media

Answers (2)

Related Questions