Grigory Skvortsov
Grigory Skvortsov

Reputation: 463

NIFI: Proper way to consume kafka and store data into hive

I have the task to create kafka consumer that should extract messages from kafka, transfrom it and store into Hive table.

So, in kafka topic there are a lot of messages as json object.

I like to add some field and insert its into hive.

I create flow with following Nifi-processors:

  1. ConsumeKafka_2_0
  2. JoltTransformJSON - for transform json
  3. ConvertRecord - to transform json into insert query for hive
  4. PutHiveQL

The topic will be sufficiently loaded and handle about 5Gb data per day.

So, are the any ways to optimize my flow (i think it's a bad idea to give a huge amount of insert queries to Hive)? Maybe it will be better to use the external table and putHDFS Processor (in this way how to be with partition and merge input json into one file?)

Upvotes: 1

Views: 1441

Answers (1)

mattyb
mattyb

Reputation: 12093

As you suspect, using PutHiveQL to perform a large number of individual INSERTs is not very performant. Using your external table approach will likely be much better. If the table is in ORC format, you could use ConvertAvroToORC (for Hive 1.2) or PutORC (for Hive 3) which both generate Hive DDL to help create the external table.

There are also Hive streaming processors, but if you are using Hive 1.2 PutHiveStreaming is not very performant either (but should still be better than PutHiveQL with INSERTs). For Hive 3, PutHive3Streaming should be much more performant and is my recommended solution.

Upvotes: 2

Related Questions