Reputation: 2078
for research purpose I'm studying an architecture to do real-time (and also offline) data analytics and semantic annotation. I've attached a basic schema: I have some sensors linked to a raspberry pi 3. I suppose can handle this link with a mqqt broker like mosquitto. However, I want to collect data on raspberry, do something, and forward them to a cluster of commodity hardware to perform real time reasoning with Spark or Storm (any hint about which?). Then these data have to be stored in a NoSql db (Cassandra or HBase probably) accessible to an Hadoop cluster to execute batch reasoning, semantic data enrichment on them and re-store on same db. Therefore clients can query system to extract useful informations.
Which technology should I use in the red block? My idea is for MQQT but Kafka maybe could fit better my purposes?
Upvotes: 4
Views: 413
Reputation: 661
You can evaluate Apache Apex for your use case as most of your requirements could be satisfied with it. Apache Apex also comes with Apache Malhar project which serves operator library for Apache Apex. Since you are deciding to use MQTT protocol, Apache Malhar also has prebuilt in AbstractMQTTInputOperator/AbstractMQTTOnputOperator which you can extend and it can serve as input broker. Malhar also comes with various operators which can connect to different NoSQL Dbs as well dumping to HDFS. Apache Apex may not require kafka in your proposed architecture. As you want to push data to Hadoop, being Hadoop native Apex actually can reduce our deployment efforts significantly.
Another interesting project I had come across is Apache Edgent which can help you to perform some real-time analytics at edge devices.
PS: I am a contributor to Apache Apex/Malhar project.
Upvotes: 1
Reputation: 538
What about using Apache Nifi?
There is an article describing the use case very similar to yours. To output your data to HDFS you can use PutHDFS or PutHiveQL, then use Hive LLAP to provide the access to the data for your clients.
Using Apache Nifi you can deliver working prototype very fast with zero (or maybe almost zero) development. Probably you will spend more time for performance tuning, deployment, and customization on productization step of your system, but this part is mandatory for any open source tool.
Upvotes: 0
Reputation: 601
Spark vs Storm
Spark is the clear winner right now between Spark and Storm. At least one reason is that Spark is much more capable of handling large data volumes in a performant way. Storm struggles with processing large volumnes of data at a high velocity. For the most part the Big data community has embraced Spark, at least for now. Other technologies like Apex, and Kafka Streams are making waves in the Stream Processing space.
Kafka Producing to Raspberry Pi
If you choose the Kafka path keep in mind that the Java client for Kafka is by far, in my experience, most reliable implementation. However, I would do a proof of concept to ensure that there won't be any memory issues since the Rasberry Pi doesn't have a lot of RAM on it.
Kafka At the Heart
Keeping Kafka in your RED box will give you a very flexible architecture moving forward because any process: Storm, Spark, Apex, Kafka Streams, Kafka Consumer can connect to Kafka and quickly read the data. Having Kafka at the heart of your architecture provides you with a "distribution" point for all your data since its very fast but also allows for data to be permanently stored there. Keep in mind that you can't query Kafka, so using it will require you to simply read the messages as fast as you can to populate other datastores or to perform streaming calculations.
Upvotes: 5