lengthy_preamble
lengthy_preamble

Reputation: 434

Nifi: Need clarification on the Merge Content processor

Because I don't think it works the way my supervisor think it works.

We're taking in a series of about 8 csv files from an FTP and these files are rather small (under 1MB). He's (rightfully, I think) concerned that cluster size on HDFS is going to be wasted. So he wants to use the Merge Content processor to resolve this. He seems to believe that the Merge Content processor will 'collate' files with the same name, making a bigger single file.

To clarify: The way he wants it to work is if today's "sales_report.csv" comes in and there's already a "sales_report.csv" existing in the directory, he wants the new data from today's "sales_report.csv" to be added as new rows to the existing file. I hope that makes sense.

Instead, I'm getting very different results. I have the flow set up so that it picks the files up from the FTP, creates a directory on HDFS based on the folder, and then a subfolder based on the year. When I leave the MC processor out, this all works perfectly. When I put the MC processor in, I get three files - one of them has its original name and two of them have a long string of random characters. We're using the default settings for the Merge Content processor.

Based on what I've described above it does it sound like the MC processor is what we're looking for?

Upvotes: 2

Views: 1176

Answers (1)

Andy
Andy

Reputation: 14194

The MergeContent processor works by combining multiple flowfiles into a single flowfile. This is not the same as appending new data to an existing file stored in HDFS (what your manager wants).

To accomplish this, you have a few options:

  1. Keep your current flow without the MergeContent processor; you will still have the "small files" problem with HDFS.
  2. Use a SQL-like interface to HDFS, such as Hive (and optionally HBase (why)). You can then consume the new data (today's sales_report.csv), and treat the rows in that file as NiFi records and persist them to the appropriate Hive table (effectively accomplishing an append operation).
  3. Retrieve the existing sales_report.csv from HDFS, combine the contents with the new content using MergeContent, and persist the new merged content back to HDFS. This is a fairly wasteful operation, and not recommended. (See Iterative Processing in Alan Gates' Pig and Hive at Yahoo!).

Which option you pursue is dependent on your specific requirements:

  • Does the data need to be stored in the same file in HDFS, or just be accessible in the same directory?
  • Does the data need to be stored in the original CSV file format, or is tabular storage acceptable?
  • How large is the "existing" data stored in HDFS vs. the new incoming data?

Upvotes: 2

Related Questions