Hadoop/Cassandra - how to store and analyse data from thousands of sensors?

Question

I am quite new to « Big Data » technologies, especially Cassandra, so I need your advices for the task I have to do. I have been looking to Datastax examples about handling timeseries, and different discussion here about this topic, but if you think I might have missed something, feel free to tell me. Here it my problem.

I need to store and analyze data coming from about 100 sensor stations that we are testing. In each sensor station, we have several thousand sensors. So for each station, we run several tests (about 10, each one lasting about 2h30), during which the sensors are recording information every millisecond (can be boolean, integer or float). The records of each test are kept on the station during the test, then they are sent to me once the test is completed. It means about 10 GB for each test (each parameter is about 1 MB of information).

Here is a schema to illustrate the hierarchy: Hierarchy description

Right now, I have access to a small Hadoop Cluster with Spark and Cassandra for testing. I may be able to install other tools, but I would really appreciate to keep working with Spark/Cassandra.

My question is: what could be the best data model for storing then analyzing the information coming from these sensors?

By “analyzing”, I mean:

find min, max, average value on a specific parameter recorded by a specific sensor on a specific station; or find those values for a specific parameter but for all the station; or find those value for a specific parameter but when other parameters (one or two) of the same station are upper than a limit
plot the evolution of one or more parameters to compare them visually (the same parameter on different stations, or different parameters on the same station)
do some correlation analysis between parameters or stations (eg. to find if a sensor is not working).

I was thinking of putting all the information in a Cassandra Table with the following data model:

CREATE TABLE data_stations (
station text,           // station ID
test int,               // test ID
parameter text,         // name of recorded parameter/sensor
tps timestamp,          // timestamp
val float,              // measured value
PRIMARY KEY ((station, test, parameter), tps)
);

However, I don’t know if one table would be able to handle all the data : a quick calculation give 10^14 different rows according to the precedent data model (100 stations x 10 test x 10 000 parameters x 9,000,000ms (2h30 in milliseconds) ~= 10^14), even if each partition is “only” 9,000,000 rows.

Other ideas were to split the data in different table (eg. One table per station, or one table per test per station, etc.). I don’t know what and how to choose, so any advice is welcome!

Thank you very much for your time and help, if you need more information or details I would be glad to tell you more.

Piar

Ani Menon · Accepted Answer

You are on the right track, Cassandra can handle such data. You may store all the data you want it column families and use Apache Spark over Cassandra to do the required aggregations.

I feel Apache Spark is good for your use case as it could be used for aggregations and calculating correlations.

You may also check out Apache Hive as it can work/query over data in HDFS directly(through external tables).

Check these :

Cassandra - Max. size of wide rows?

Limitations of Cassandra

Hadoop/Cassandra - how to store and analyse data from thousands of sensors?

Answers (1)

Related Questions