Efficient storage pattern for millions of values of different types

Question

I am about to build an SQL database that will contain the results of statistics calculations for hundreds of thousands of objects. It is planned to use Postgres, but the question equally applies to MySQL.

For example, hypothetically, let's assume I have half a million records of phone calls. Each PhoneCall will now, through a background job system, have statistics calculated. For example, a PhoneCall has the following statistics:

call_duration: in seconds (float)
setup_time: in seconds (float)
dropouts: periods in which audio dropout was detected (array), e.g. [5.23, 40.92]
hung_up_unexpectedly: true or false (boolean)

These are just simple examples; in reality, the statistics are more complex. Each statistic has a version number associated with it.

I am unsure as to which storage pattern for these type of calculated data will be the most efficient. I'm not looking into fully normalizing everything in the database though. So far, I have come up with the following options:

Option 1 – long format in one column

I store the statistic name and its value in one column each, with a reference to the main transaction object. The value column is a text field; the value will be serialized (e.g. as JSON or YAML) so that different types (strings, arrays, ...) can be stored. The database layout for the statistics table would be:

statistic_id (PK)
phone_call_id (FK)
statistic_name (string)
statistic_value (text, serialized)
statistic_version (integer)
created_at (datetime)

I have worked with this pattern for a while, and what's good about it is that I can easily filter statistics according to phone call and the statistic name. I can also add new types of statistics easily and filter by version and creation time.

But it seems to me that the (de)serialization of values makes it quite inefficient in terms of handling lots of data. Also, I cannot perform calculations on SQL-level; I always have to load and deserialize the data. Or is the JSON suppot in Postgres that good so that I could still pick this pattern?

Option 2 – statistics as attributes of main object

I could also think about collecting all types of statistic names and adding them as new columns to the phone call object, e.g.:

id (PK)
call_duration
setup_time
dropouts
hung_up_unexpectedly
...

This would be very efficient, and each column would have its own type, but I can no longer store different versions of statistics, or filter them according to when they were created. The whole business logic of statistics disappears. Adding new statistics is also not possible easily since the names are baked in.

Option 3 – statistics as different columns

This would probably be the most complex. I am storing only a reference to the statistic type, and the column will be looked up according to that:

statistic_id (PK)
phone_call_id (FK)
statistic_name (string)
statistic_value_bool (boolean)
statistic_value_string (string)
statistic_value_float (float)
statistic_value_complex (serialized or complex data type)
statistic_value_type (string that indicates bool, string etc.)
statistic_version (integer)
created_at (datetime)

This would mean that the table is going to be very sparse, as only one of the statistic_value_ columns would be populated. Could that lead to performance issues?

Option 4 – normalized form

Trying to normalize option 3, I would create two tables:

statistics
- id (PK)
- version
- created_at
statistic_mapping
- phone_call_id (FK)
- statistic_id (FK)
statistic_type_mapping
- statistic_id (FK)
- type (string, indicates bool, string etc.)
statistic_values_boolean
- statistic_id (FK)
- value (bool)
…

But this isn't going anywhere since I can't dynamically join to another table name, can I? Or should I anyway then just join to all statistic_values_* tables based on the statistic ID? My application would have to make sure that no duplicate entries exist then.

To summarize, given this use case, what would be the most efficient approach for storing millions of statistic values in a relational DB (e.g. Postgres), when the requirement is that statistic types may be added or changed, and that several versions exist at the same time, and that querying of the values should be somewhat efficient?

Cepr0 · Accepted Answer

IMO you can use the following simple database structure to solve your problem.

Statistics type dictionary

A very simple table - just name and description of the stat. type:

create table stat_types (
  type        text not null constraint stat_types_pkey primary key,
  description text  
);

(You can replace it with enum if you have a finite number of elements)

Stat table for every type of objects in the project

It contains FK to the object, FK to the stat. type (or just enum) and, this is important, the jsonb field with an arbitrary stat. data related to its type. For example, such a table for phone calls:

create table phone_calls_statistics ( 
  phone_call_id uuid  not null references phone_calls,
  stat_type     text  not null references stat_types,
  data          jsonb,
  constraint phone_calls_statistics_pkey primary key (phone_call_id, stat_type)  
);

I assume here that table phone_calls has uuid type of its PK:

create table phone_calls (
  id uuid not null constraint phone_calls_pkey primary key
-- ...
);

The data field has a different structure which depends on its stat. type. Example for call duration:

{
   "call_duration": 120.0
}

or for dropouts:

{
   "dropouts": [5.23, 40.92]
}

Let's play with data:

insert into phone_calls_statistics values 
  ('9fc1f6c3-a9d3-4828-93ee-cf5045e93c4c', 'CALL_DURATION', '{"call_duration": 100.0}'),
  ('86d1a2a6-f477-4ed6-a031-b82584b1bc7e', 'CALL_DURATION', '{"call_duration": 110.0}'),
  ('cfd4b301-bdb9-4cfd-95db-3844e4c0625c', 'CALL_DURATION', '{"call_duration": 120.0}'),
  ('39465c2f-2321-499e-a156-c56a3363206a', 'CALL_DURATION', '{"call_duration": 130.0}'),
  ('9fc1f6c3-a9d3-4828-93ee-cf5045e93c4c', 'UNEXPECTED_HANGUP', '{"unexpected_hungup": true}'),
  ('86d1a2a6-f477-4ed6-a031-b82584b1bc7e', 'UNEXPECTED_HANGUP', '{"unexpected_hungup": true}'),
  ('cfd4b301-bdb9-4cfd-95db-3844e4c0625c', 'UNEXPECTED_HANGUP', '{"unexpected_hungup": false}'),
  ('39465c2f-2321-499e-a156-c56a3363206a', 'UNEXPECTED_HANGUP', '{"unexpected_hungup": false}');

Get the average, min and max call duration:

select 
  avg((pcs.data ->> 'call_duration')::float) as avg,
  min((pcs.data ->> 'call_duration')::float) as min,
  max((pcs.data ->> 'call_duration')::float) as max
from 
  phone_calls_statistics pcs 
where 
  pcs.stat_type = 'CALL_DURATION';

Get the number of unexpected hung ups:

select 
  sum(case when (pcs.data ->> 'unexpected_hungup')::boolean is true then 1 else 0 end) as hungups  
from 
  phone_calls_statistics pcs 
where 
  pcs.stat_type = 'UNEXPECTED_HANGUP';

I believe that this solution is very simple and flexible, has good performance potential and perfect scalability. The main table has a simple index; all queries will perform inside it. You can always extend the number of stat. types and their calculations.

Live example: https://www.db-fiddle.com/f/auATgkRKrAuN3jHjeYzfux/0