Josh Harrison
Josh Harrison

Reputation: 434

MergeContent with nifi - inconsistent length

I am attempting to write a file on disk with the MergeContent processor, but I'm getting significantly varying file sizes - anywhere from one line to 806 lines. I've repeated the process many times over trying to figure out the newline demarcator as addressed in Apache NIFi MergeContent processor - set demarcator as new line and I've gotten really randomly sized files.

What parameters do I need to set to adhere to the following logic?

  1. Establish a single bin
  2. Route all flowfiles into bin
  3. If len(bin)>X or the age of the bin is greater than Max Bin Age, release the bin

To fully document, I currently have the following attributes defined: Merge Content Processor settings Merge Content Processor settings

As you can see, I've set "Max Bin Age" to "10 sec" following the syntax in https://github.com/apache/nifi/blob/31fba6b3332978ca2f6a1d693f6053d719fb9daa/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/test/java/org/apache/nifi/processors/standard/TestMergeContent.java#L219 (which is the only place I've managed to find an example of this value, the documentation seems incomplete on this parameter)

I've set "Maximum Number of Entries" to 5000, and "Maximum number of Bins" to 1

What do I need to do to aggregate my records following the logic above? I also tried using the "Correlation Attribute Name" parameter with an attribute guaranteed to be identical on all documents reaching this point, and saw the same

Upvotes: 7

Views: 6083

Answers (2)

Ryan Shirley
Ryan Shirley

Reputation: 339

In case anyone is having this exact issue, the cause may be not setting the schedule on the MergeContent processor. After a lot of troubleshooting, I realized that this is one of those processors where "0 sec" is not an appropriate schedule. I had already set my Min Entries to some high number and Max Entries. Max Bin Age was set to 5 min. It was the schedule that was causing the processor to keep grabbing flowfiles and bundling them up in random sizes.

Upvotes: 0

apiri
apiri

Reputation: 1633

The most important thing here is actually the minimum number of entries. What is happening is that the binning algorithm takes a lenient approach in terms of the number of items.

For your specific logic, you would want to let things as they stand and:

  • Set Minimum Number of Entries to 5000
  • Optionally, increase the maximum number of entries. Leaving it as configured will generate bins that are exactly 5000 entries except for those periods where the age interval has been eclipsed

Below is an image of the configuration above where min and max bin size are both 5000 and only 1 bin is handled at a time. In this case you'll see that exactly 20000 files have been merged into 4.

Sample execution for a min and max bin size of 5000

Upvotes: 7

Related Questions