OverflowStack
OverflowStack

Reputation: 919

Storing semi-structured, key-value based data

I have IoT system where devices send data that doesn't strictly adhere to any specific schema, but there is always "telemetry" object.

{
    "deviceId": "SimulatedDevice",
    "telemetry": 
    {
        "temperature": 23.34,
        "status": 
        {
            "health": "ok"
        }
    }
}

Telemetry data gets flattened into simple key-value dictionary.

{
    "temperature": 23.34,
    "status.health": "ok"
}

The key is always string. The value is always string, numeric, boolean or null.

Question

What would be the best approach to store this sort of data?

Option 1

Use JSONB/JSON for value. You can store anything in value as long as it is valid JSON.

CREATE TABLE telemetry 
(
    time            TIMESTAMPTZ    NOT NULL,
    deviceId        TEXT           NOT NULL,
    key             TEXT           NOT NULL,
    value           JSONB
)

Option 2

Create field for every possible value type. If value is Numeric, I insert it into value_numeric column and make other columns NULL. If the value is NULL, then all columns are NULL.

CREATE TABLE telemetry 
(
    time            TIMESTAMPTZ    NOT NULL,
    deviceId        TEXT           NOT NULL,
    key             TEXT           NOT NULL,
    value_string    TEXT,
    value_numeric   NUMERIC,
    value_boolean   BOOLEAN
)

IoT data can grow very large, potentially hundreds of millions of rows. Which option, in long run, would yield the best performance?

-- Option 1
SELECT * FROM telemetry 
WHERE time >= '...' AND time < '...' AND deviceId = '...' AND value > '75'

-- Option 2
SELECT * FROM telemetry 
WHERE time >= '...' AND time < '...' AND deviceId = '...' AND value_numeric > 75

I imagine using JSONB/JSON column (option 1) adds more overhead, but maybe it is very negligible? Are there any general guidelines related to what I am trying to achieve?

Upvotes: 1

Views: 153

Answers (1)

noctarius
noctarius

Reputation: 6104

I wrote a best practices blog post about it a while ago. It explains the two formats you presented (narrow and medium, as well as wide), and gives pros and cons accordingly.

I think you might find that interesting: https://www.timescale.com/blog/best-practices-for-time-series-data-modeling-narrow-medium-or-wide-table-layout-2/

Pure JSONB may bite you eventually due to the additional overhead of parsing / extracting the value, and when you add compression into the game, you need to make sure that all deciding factors (when selecting) are indexed, so you only decompress / read the JSONB column when really necessary.

I hope this helps :)

Upvotes: 2

Related Questions