Reputation: 53
I'm ingesting flowfiles containing Avro records with NiFi, and need to insert them into HBase. These flowfiles vary in size, but some have 10,000,000+ records. I use SplitAvro twice (one to split to 10,000 recs, then one to split to 1 rec), then use an ExecuteScript processor to pull out the row key for HBase and add it as a flowfile attribute. Finally I use PutHBaseCell (with a batch size of 10,000) to write to HBase using the row key attribute..
The processor that splits the Avro to 1 rec is very slow (Concurrent tasks is set to 5). Is there a way to speed that up? And is there a better way to load this Avro data into HBase?
(I am using a 2 node NiFi (v1.2) cluster (made from VMs), each node has 16 CPUs and 16GB RAM.)
Upvotes: 1
Views: 539
Reputation: 18630
There is a new PutHBaseRecord processor that will be part of the next release (there is a 1.4.0 release being voted upon right now).
With this processor you would avoid ever splitting your flow files, and you just send a flow file will millions of Avro records right to PutHBaseRecord, and PutHBaseRecord would be configured with an Avro reader.
You should get significantly better performance with this approach.
Upvotes: 1