Reputation: 2472
I am currently setting up a simple NiFi flow that reads from a RDBMS source and writes to a Hive sink. The flow works as expected until the PuHiveSql processor, which is running extremely slow. It inserts one record every minute approximately.
Currently is setup as a standalone instance running on one node.
The logs showing the insert every 1 minute approx:
(INSERT INTO customer (id, name, address) VALUES (x, x, x)
)
Any ideas about why this may be? Improvements to try?
Thanks in advance
Upvotes: 1
Views: 1372
Reputation: 31490
Inserting one record at a time into Hive will result extreme slowness.
As your doing regular insert into hive table:
Change your flow:
QueryDatabaseTable
PutHDFS
Then create Hive avro
table on top of HDFS directory where you have stored the data.
(or)
QueryDatabaseTable
ConvertAvroToORC //incase if you need to store data in orc format
PutHDFS
Then create Hive orc
table on top of HDFS directory where you have stored the data.
Upvotes: 2
Reputation:
Are you poshing one record at time? if so may use the merge record process to create batches before pushing into HiveQL,
It is recommended to batch into 100 records : See here: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-hive-nar/1.5.0/org.apache.nifi.processors.hive.PutHiveQL/
Batch Size | 100 | The preferred number of FlowFiles to put to the database in a single transaction
Use the MergeRecord process and set the number of records or/and timeout, it should speed-up considerably
Upvotes: 0