Alejandro A
Alejandro A

Reputation: 1190

Best approach for this data pipeline?

I need to design a pipeline using Nifi, but I have some questions as I am thinking between two approaches and I am unsure which processors to use, so maybe you can help me.

The scenario is the following: I need to ingest some .csv files into my HDFS, those do not contain a date I want to use to partition the Hive tables I will later use, so I thought of two options:

  1. At some point during the .csv treatment, create some kind of code snippet that is launched from Nifi to modify the .csv file adding the column with the date.
  2. Create a temporary (internal?) table on hive, alter the table adding the column and finally add it to the table where I partition by date.

I am unsure which option is better (memory-wise, simplicity, resource management) or maybe if its even possible, or even if there is a better way to do it. Also I am unsure of which are the Nifi processors to use.

So any help is appreciated guys, thanks.

Upvotes: 0

Views: 48

Answers (1)

Bryan Bende
Bryan Bende

Reputation: 18630

You should be able to do #1 easily in NiFi without writing any code :)

The steps would be something like this:

  1. Source processor to get your CSV from somewhere, probably GetFile
  2. UpdateAttribute to add an attribute for the current date
  3. UpdateRecord with a CsvReader and CsvWriter, adds a new date field with the value from #2

I've created an example of how to do this and posted the template here:

https://gist.githubusercontent.com/bbende/113f8fa44250c09a5282d04ee600cd09/raw/c6fe8b1b9f31bb106f9c816e4fd5ea90ebe19f80/CsvAddDate.xml

Save that xml file and use the palette on the left of NiFi canvas to upload it as a template. Then instantiate the template from the top toolbar by dragging on the template icon.

Upvotes: 2

Related Questions