Aviral Kumar
Aviral Kumar

Reputation: 824

Nifi Content Vs Attribute Modification Techniques

In Nifi we can design a flow in two ways :

  1. Content Based Modification (UpdateContent) - In this approach we are directly modifying the content of flowfiles . With this at each stage , the flowfile content will get persisted in flow file repository.

Sample Flow :

ListFile -> FetchFile -> ValidateRecord (sanity) -> UpdateContent -> CSVtoAvro -> AvrotoORC - >PutHDFS
  1. Attribute Based Modification (UpdateAttribute) - In this approach we are storing the contents of the flowfiles in memory as attributes and modifying them directly . Once the updates are done we are writing the attributes to flow file and then merging the flowfiles using MergeContent.

In terms of performance we are getting much better performance in the First case , in the second case many of the processors are slow like ExtractText and specially MergeContent. Having said that I have also done concurrent thread and backpressure level modifications , but still could not achieve better performance.

List File -> FetchFile -> Extract Text -> UpdateAttribute ->AttributeToCSV -> CSVtoAvro -> AvrotoORC -> Mergecontent -> PutHDFS (Rough flow)

I want to understand why attribute approach is less performant and if I am doing something wrong . Please suggest.

We have 200 columns file with all of them treated as attributes for modification. The machine is 32 GB machine with (16GB NIFI) and Quad core Intel Core i7-4771 with HDD Total Size: 500.1GB.

Upvotes: 1

Views: 1552

Answers (1)

VB_
VB_

Reputation: 45712

Little bit of theory

  1. Content-based modification - is based on Content Repository. It's just multiple binary append-only files on Nifi's local disk, that are linked to Flow Files by file path and offset (here you can find more).
  2. Atrribute-based modification: attributes are just a map inside JVM Heap, backed by Write Ahead Log (here you can fing more). So attribute-based modification works with in-memory data and is faster.

Two possible issues

  1. It doesn't look for me that you're working with attribute-based modification. MergeContent still working on content, so you need to drop Flow File content after UpdateAtribute and before MergeContent.

  2. Alternatively you may also check the volume of attributes. If you have too much attributes, in-memory map will be spilled to disk and you will loose the benefit of working with in-memory. But I think the first point is the issue

P.S.

If you think it's not a case, update your question with number of flow files, volume of extracted text to attributes, machine characteristics, maybe details about content-based approach so I will be able to compare...

UPD after question update

Your content-based flow:

(1) ListFile -> (2) FetchFile -> (3) ValidateRecord (sanity) -> (4) UpdateContent -> (5) CSVtoAvro -> (6) AvrotoORC -> (7) PutHDFS

Here, at steps 3, 4, 5 and 6 you're doing copy-on-write: do read from Content Repository (local file system) for each Flow File, modify them and append back to ContentRepository. So you're doing 4 read-write iterations.

Your attribute-based flow:

(1) List File -> (2) FetchFile -> (3) Extract Text -> (4) UpdateAttribute -> (5) AttributeToCSV -> (6) CSVtoAvro -> (7) AvrotoORC -> (8) Mergecontent -> (9) PutHDFS

Here, at steps 6 and 7 you are still doing 2 read-write iteractions. Moreover, MergeContent is another bottleneck, that is absent at the first option. MergeContent is reading all input data from disk, merge them (in memory I think) and copy result back to disk. So steps 6, 7 and 8 are already slow enough to give you as bad performance as on content-based flow. Moreover, step 3 is copying content to memory (another read from disk), and you may exprience disk swaps.

So with attribute-based flow it looks like you have almost the same volume/amount of disk read/write transactions. In the same time you also may have contention for RAM (JVM heap), because all your content is stored in memory multiple times:

  • Each version (sanitized, updated, etc) of attribute is stored in memory
  • MergeContent may store another part of data in memory So maybe you have even more disk iteractions because of disk swap (but this should be checked, it depends on files volume simultaneously processsed).

Another point is that answer depends on how are you doing transformations.

Also, what processors are you using for the first approach? Are you aware about QueryRecord processor?

Upvotes: 4

Related Questions