Simple approach for storing time series data

Question

I have about 1 billion events daily. I need to store these events in the database for last 30 days, so it's about 30 billion rows.

Let's say it is athletes database, each row has only 4 column (athlete name, athlete's discipline, athlete rank, date). I need to retrieve data only by athlete name and date. For example build a graph for the last 30 days for particular athlete.

Initially I was using Google Big Query, this is great tool, extremely cheap, with daily sharding out of the box and linear scalability but with few drawbacks. Querying 3 billions table takes about 5 seconds, too much for my case. When data is inserted it appears in the "Streaming buffer" and can't be query for some time (about 5-10 minutes )
Another approach use Postgres and store all the data in the one table with proper indexes. Also I can use daily sharding (create new table automatically at the beginning of the day) But I have concerns whether Postgres can handle billion rows. Also if I want to get historical data for last 30 days, I have to make 30 SELECT queries when sharding data in such way.

I don't want to bother with over-complicated solutions like Cassandra (have never tried it though). Also I don't think I will get any benefits from using column-oriented database, because I have only 4 columns.

Looking for something similar to Big Query but without mentioned drawbacks. I think data can be stored in one node.

Evgeny Lazin · Accepted Answer

The data can be stored using only one node. Actually, 1 billion rows per day is not much. It's only about 32K writes/second. For comparison, Akumuli can handle about 1.5 million inserts / second on m4.xlarge AWS instance with SSD (almost half of that with EBS volume with default settings but you can provision more IOPS). To store 30B data-points you will need less than 200GB of disk space (it depends on your data but it's safe to assume that the data-point will take less than 5 bytes on disk).

The data model is simple in your case. The series name would look like this:

athlet_rank name= discipline=

You will be able to query the data by name:

{
  "select": "athlete_rank",
  "range": { "from": "20170501T000000",
             "to": "20170530T000000" },
  "where": { "name":  }
}

You shouldn't chose Akumuli if you have large cardinality (many unique series). It consumes about 12KB of RAM per series, e.g. to handle the database with 1 million series you will need a server with at least 16GB of RAM (the actual number depend on series size). This will be improved eventually but at the moment this is what we've got.

Disclaimer: I'm the author of Akumuli so I'm a bit biased. But I'll be happy to get any feedback, good or bad.

Simple approach for storing time series data

Answers (1)

Related Questions