I am trying to design a system that has time-series data. Here are the requirements of the system: Data comes into the system anytime of the day from multiple cities and departments. These individual records should be available for life-time. (optional) Most of the queries are aggregations. However, they happen on multiple columns. Example queries: Sum of all measurements on a particular day in a particular city Sum of all measurements in a (last) month in a department Sum of all measurements in a (last) week in a city for a department Sum of all measurements on a particular day in a state (State will also be part of the record) All queries should be computed with a very low latency (< 300-500 ms) Here are the constraints I have: I went through multiple time-series databases. From my understanding all of them require expensive hardware. I would like to know if there is a way we can run it on commodity hardware. I am not sure how much data will be available at the beginning. Ideally it would be around 100 MB at peak(per day). So, I am not willing to spend hundreds of dollars at the beginning. Another idea I have been thinking about is to have an Amazon RDS instance and have multiple indexes on various columns and aggregate them as needed. I am not sure if this is a good idea though.

databasetime-seriesamazon-rdsdruidriak-ts

emkay

Reputation: 169

Timeseries database with multi-column aggregation

I am trying to design a system that has time-series data. Here are the requirements of the system:

Data comes into the system anytime of the day from multiple cities and departments.
These individual records should be available for life-time. (optional)
Most of the queries are aggregations. However, they happen on multiple columns. Example queries:
- Sum of all measurements on a particular day in a particular city
- Sum of all measurements in a (last) month in a department
- Sum of all measurements in a (last) week in a city for a department
- Sum of all measurements on a particular day in a state (State will also be part of the record)
All queries should be computed with a very low latency (< 300-500 ms)

Here are the constraints I have:

I went through multiple time-series databases. From my understanding all of them require expensive hardware. I would like to know if there is a way we can run it on commodity hardware.
I am not sure how much data will be available at the beginning. Ideally it would be around 100 MB at peak(per day). So, I am not willing to spend hundreds of dollars at the beginning.
Another idea I have been thinking about is to have an Amazon RDS instance and have multiple indexes on various columns and aggregate them as needed. I am not sure if this is a good idea though.

Upvotes: -1

Answers (2)

Sergei Rodionov

Reputation: 4549

The aggregations you outlined are rather standard. It would be hard to find a database that wouldn't support them.

You could try Axibase TSD. It can efficiently calculate calendar aggregates (e.g. daily totals) with multiple dimensions in the user-defined timezone. This would be useful if the metrics you're collecting are driven by end-user activity:

SELECT date_format(time, "yyyy-MMM-dd", "US/Eastern"), 
  entity AS 'city', SUM(value)
FROM "email.active_sessions"
  WHERE datetime >= current_year
  AND entity.tags.state = 'PA'
GROUP BY entity, PERIOD(1 DAY, "US/Eastern")

The intervals can be conveniently specified with extended keywords/functions. Here's how '(last) week' condition looks like:

WHERE datetime >= previous_week AND datetime < current_week

Refer to SQL docs for additional examples.

ATSD runs on most Linux distributions and can be executed in both stand-along and distributed modes.

Disclosure: I work for Axibase.

Upvotes: 0

Evgeny Lazin

Reputation: 9423

You can try Akumuli (warning: I'm the author). Akumuli can perform aggregations pretty quickly (sub millisecond) because it's a column-oriented database and it precomputes some aggregates for you. And it can perform all the queries that you need, e.g. if your series looks like this:

${measurement_name} city=${city_name} dept=${department_name} state=${state_name}

And the actual data looks like this (e.g this is a power consumption data measured in watts):

W city=Jersey_City state=New-Jersey dept=1
W city=Jersey_City state=New-Jersey dept=2
W city=Paris state=Iowa dept=1
W city=Paris state=Iowa dept=2
W city=Texas state=Texas dept=1
W city=Texas state=Texas dept=3
W city=Paris state=Arkansas dept=1
W city=Paris state=Arkansas dept=2

You can aggregate all data with dept=1 and dept=2 tags using this query:

{
    "aggregate": { "W": "sum" },
    "range": { "from": "20170501T000000",
               "to": "20170502T000000" },
    "group-by": [ "dept" ],
    "where": { "dept": [1, 2] }
}

You will get the sum for each dept but only departments 1 and 2 will be included.

You can split by the city instead:

{
    "aggregate": { "W": "sum" },
    "range": { "from": "20170501T000000",
               "to": "20170502T000000" },
    "group-by": [ "city" ],
}

You will get sum for every city name (all departments in one city will be joined). Of cause, you can filter by tag using where clause.

If you have several cities with the same name (e.g. Paris in Iowa and Arkansas) you can group values by city and state:

{
    "aggregate": { "W": "sum" },
    "range": { "from": "20170501T000000",
               "to": "20170502T000000" },
    "group-by": [ "city", "state" ],
}

You will get one value for Paris Arkansas and another for Paris Iowa.

Upvotes: 1

Timeseries database with multi-column aggregation

Answers (2)

Related Questions