Mike Neumegen
Mike Neumegen

Reputation: 2486

What's the best way to store event data in Redshift?

I'm new to Redshift and am looking at the best way to store event data. The data consists of an identifier, time and JSON metadata about the current state.

I'm considering three approaches:

  1. Create a table for each event type with a column for each piece of data.
  2. Create a single table for events and store metadata as a JSON field.
  3. Create a single table with a column for every possible piece of data I might want to store.

The advantage of #1 is I can filter on all data fields and the solution is more flexible. The disadvantage is every time I want to add a new event I have to create a new table.

The advantage of #2 is I can put all types of events into a single table. The disadvantage is to filter on any of the data in the metadata I would need to use a JSON function on every row.

The advantage of #3 is I can easily access all the fields without running a function and don't have to create a new table for each type. The disadvantage is whoever is using the data needs to remember which columns to ignore.

Is one of these ways better than the others or am I missing something entirely?

Upvotes: 2

Views: 1713

Answers (1)

AlexYes
AlexYes

Reputation: 4208

This is a classic dilemma. After thinking for a while, in my company we ended up keeping the common properties of the events in separate columns and the unique properties in the JSON field. Examples of the common properties:

  • event type, timestamp (every event has it)
  • URL (this will be missing for backend events and mobile app events but is present for all frontend events and is worth to have in a separate column)
  • client properties: device, browser, OS (will be missing in backend but present in mobile app events and frontend events)

Examples of unique properties (no such properties in other events):

  • test name and variant in AB test event
  • product name or ID in purchase event

Borderline between common and unique property is your own judgement based on how many events share this property and how often will this property be used in the analytics queries to filter or group data. If some property is just "nice-to-have" and it is not involved in regular analysis use cases (yeah, we all love to store anything that is trackable just in case) the burden of maintaining a separate column is an overkill.

Also, if you have some unique property that you use extensively in the queries there is a hacky way to optimize. You can place this property at the beginning of your JSON column (yes, in Python JSON is not ordered but in Redshift it is a string, so the order of keys can be fixed if you want) and use LIKE with a wildcard only at the end of the field:

select * 
from event_table
where event_type='Start experiment'
and event_json like '{"test_name":"my_awesome_test"%'  -- instead of below
-- and json_extract_path_text(event_json,'test_name')='my_awesome_test'

LIKE used this way works much faster than JSON lookup (2-3x times faster) because it doesn't need to scan every row, decode JSON, find the key and check the value but it just checks if the string starts with a substring (much cheaper operation).

Upvotes: 5

Related Questions