user393144
user393144

Reputation: 1635

HIVE/HDFS for realtime storage of sensor data on a massive scale?

I am evaluating sensor data collection systems with the following requirements,

  1. 1 million endpoints sending in 100 bytes of data every minute (as a time series).
  2. Basically millions of small writes to the storage.

    This data is write-once, so basically it never gets updated.
    
  3. Access requirements
    a. Full data for a user needs to be accessed periodically (less frequent)
    b. Partial data for a user needs to be access periodically (more frequent). For e.g I need sensor data collected over the last hour/day/week/month for analysis/reporting.

Have started looking at Hive/HDFS as an option. Can someone comments on the applicability of Hive in such a use case? I am concerned that while the distributed storage needs would work, it seems more suited to data warehousing applications than real time data collection/storage.

Do HBase/Cassandra make more sense in this scenario?

Upvotes: 3

Views: 2584

Answers (2)

Arnon Rotem-Gal-Oz
Arnon Rotem-Gal-Oz

Reputation: 25909

I think HBase can be a good option for you. In fact there's already an open/source implementation in HBase which solves similar problem that you might want to use. Take a look at openTSB which is an open source implementation for solving similar problems. Here's a short excerpt from their blurb:

OpenTSDB is a distributed, scalable Time Series Database (TSDB) written on top of HBase. OpenTSDB was written to address a common need: store, index and serve metrics collected from computer systems (network gear, operating systems, applications) at a large scale, and make this data easily accessible and graphable. Thanks to HBase's scalability, OpenTSDB allows you to collect many thousands of metrics from thousands of hosts and applications, at a high rate (every few seconds). OpenTSDB will never delete or downsample data and can easily store billions of data points. As a matter of fact, StumbleUpon uses it to keep track of hundred of thousands of time series and collects over 600 million data points per day in their main production datacenter.

Upvotes: 6

Tyler Hobbs
Tyler Hobbs

Reputation: 6932

There are actually quite a few people collecting sensor data in a time-series fashion with Cassandra. It's a very good fit. I recommend you read this article on basic time series in Cassandra for an idea of what your data model would be like.

Writes in Cassandra are extremely cheap, so even a moderately sized cluster could easily handle one million writes per minute.

Both of your read queries could be answered very efficiently. For the second type of query, where you're reading data for a slice of time for a single sensor, you would end up reading a contiguous slice from a single row; this should take about 10ms for a completely cold read. For the first type of query, you would simply be running several of the per-sensor queries in parallel. Assuming you store a basic map of users to sensor IDs, you would lookup all of the sensor IDs for a user with one query, and then your second query would fetch the data for all of those sensors (although you might break up this query if the number of sensors is high).

Hive and HDFS don't really make sense when you're talking about real-time queries, as they're more suited for long-running batch jobs.

Upvotes: 4

Related Questions