Reputation: 824
In Nifi we can design a flow in two ways :
Sample Flow :
ListFile -> FetchFile -> ValidateRecord (sanity) -> UpdateContent -> CSVtoAvro -> AvrotoORC - >PutHDFS
In terms of performance we are getting much better performance in the First case , in the second case many of the processors are slow like ExtractText and specially MergeContent
. Having said that I have also done concurrent thread and backpressure level modifications , but still could not achieve better performance.
List File -> FetchFile -> Extract Text -> UpdateAttribute ->AttributeToCSV -> CSVtoAvro -> AvrotoORC -> Mergecontent -> PutHDFS
(Rough flow)
I want to understand why attribute approach is less performant and if I am doing something wrong . Please suggest.
We have 200 columns file with all of them treated as attributes for modification. The machine is 32 GB machine with (16GB NIFI) and Quad core Intel Core i7-4771 with HDD Total Size: 500.1GB.
Upvotes: 1
Views: 1552
Reputation: 45712
It doesn't look for me that you're working with attribute-based modification. MergeContent
still working on content, so you need to drop Flow File content after UpdateAtribute
and before MergeContent
.
Alternatively you may also check the volume of attributes. If you have too much attributes, in-memory map will be spilled to disk and you will loose the benefit of working with in-memory. But I think the first point is the issue
If you think it's not a case, update your question with number of flow files, volume of extracted text to attributes, machine characteristics, maybe details about content-based approach so I will be able to compare...
Your content-based flow:
(1) ListFile -> (2) FetchFile -> (3) ValidateRecord (sanity) -> (4) UpdateContent -> (5) CSVtoAvro -> (6) AvrotoORC -> (7) PutHDFS
Here, at steps 3, 4, 5 and 6 you're doing copy-on-write: do read from Content Repository (local file system) for each Flow File, modify them and append back to ContentRepository. So you're doing 4 read-write iterations.
Your attribute-based flow:
(1) List File -> (2) FetchFile -> (3) Extract Text -> (4) UpdateAttribute -> (5) AttributeToCSV -> (6) CSVtoAvro -> (7) AvrotoORC -> (8) Mergecontent -> (9) PutHDFS
Here, at steps 6 and 7 you are still doing 2 read-write iteractions. Moreover, MergeContent is another bottleneck, that is absent at the first option. MergeContent is reading all input data from disk, merge them (in memory I think) and copy result back to disk. So steps 6, 7 and 8 are already slow enough to give you as bad performance as on content-based flow. Moreover, step 3 is copying content to memory (another read from disk), and you may exprience disk swaps.
So with attribute-based flow it looks like you have almost the same volume/amount of disk read/write transactions. In the same time you also may have contention for RAM (JVM heap), because all your content is stored in memory multiple times:
Another point is that answer depends on how are you doing transformations.
Also, what processors are you using for the first approach? Are you aware about QueryRecord processor?
Upvotes: 4