Best approach for this data pipeline?

Question

I need to design a pipeline using Nifi, but I have some questions as I am thinking between two approaches and I am unsure which processors to use, so maybe you can help me.

The scenario is the following: I need to ingest some .csv files into my HDFS, those do not contain a date I want to use to partition the Hive tables I will later use, so I thought of two options:

At some point during the .csv treatment, create some kind of code snippet that is launched from Nifi to modify the .csv file adding the column with the date.
Create a temporary (internal?) table on hive, alter the table adding the column and finally add it to the table where I partition by date.

I am unsure which option is better (memory-wise, simplicity, resource management) or maybe if its even possible, or even if there is a better way to do it. Also I am unsure of which are the Nifi processors to use.

So any help is appreciated guys, thanks.

Best approach for this data pipeline?

Answers (1)

Related Questions