Reputation: 31
I want to send data from Kafka (doing some MapReduce job) to hive.
Is this suitable to use spark streaming?
OR some better ways?
Upvotes: 3
Views: 5761
Reputation: 4154
There's already one Hive-Kafka ETL practice in Hive document.
The users are able to create an external table that is a view over one Kafka topic
For more info: https://github.com/apache/hive/tree/master/kafka-handler
Upvotes: 1
Reputation: 191983
From a streaming perspective, Hive tables built ahead of time, dumped into using Spark Streaming or Flink will work fine, for the most part, but what if the schema of the Hive output in the Spark job changes? That's where you might want something like Streamsets, Kafka Connect HDFS Connector, or Apache Gobblin
Also, keep in mind, HDFS doesn't like dealing with tiny files, so setting up a large batch size ahead of HDFS would be beneficial for later Hive consumption
Upvotes: 1
Reputation: 32130
You can use Kafka Connect and the HDFS connector to do this. This streams data from Kafka to HDFS, and defines the Hive table on top automatically. It's available standalone or as part of Confluent Platform.
Disclaimer: I work for Confluent.
Upvotes: 3