Reputation: 434
Because I don't think it works the way my supervisor think it works.
We're taking in a series of about 8 csv files from an FTP and these files are rather small (under 1MB). He's (rightfully, I think) concerned that cluster size on HDFS is going to be wasted. So he wants to use the Merge Content processor to resolve this. He seems to believe that the Merge Content processor will 'collate' files with the same name, making a bigger single file.
To clarify: The way he wants it to work is if today's "sales_report.csv" comes in and there's already a "sales_report.csv" existing in the directory, he wants the new data from today's "sales_report.csv" to be added as new rows to the existing file. I hope that makes sense.
Instead, I'm getting very different results. I have the flow set up so that it picks the files up from the FTP, creates a directory on HDFS based on the folder, and then a subfolder based on the year. When I leave the MC processor out, this all works perfectly. When I put the MC processor in, I get three files - one of them has its original name and two of them have a long string of random characters. We're using the default settings for the Merge Content processor.
Based on what I've described above it does it sound like the MC processor is what we're looking for?
Upvotes: 2
Views: 1176
Reputation: 14194
The MergeContent
processor works by combining multiple flowfiles into a single flowfile. This is not the same as appending new data to an existing file stored in HDFS (what your manager wants).
To accomplish this, you have a few options:
MergeContent
processor; you will still have the "small files" problem with HDFS. MergeContent
, and persist the new merged content back to HDFS. This is a fairly wasteful operation, and not recommended. (See Iterative Processing in Alan Gates' Pig and Hive at Yahoo!). Which option you pursue is dependent on your specific requirements:
Upvotes: 2