Petr Novák
Petr Novák

Reputation: 115

Pentaho data integration, csv input, add filename column

Using Pentaho data integration, I have csv file input. I want to read all columns from csv, add a column with filename (if I have a abcd.csv, I want abcd) and insert it into a database table.

Any suggestions, how can I add a filename column to each row?

Upvotes: 1

Views: 7778

Answers (1)

Shastings
Shastings

Reputation: 167

I know it sounds kind of backwards but you probably will want to use the Text File Input step to parse your CSV file, rather than the CSV Input which is a subset of options from the Text File Input with some performance advantages for delimited files.

With Text File Input there are a lot more options available to you for reading the file. You can set the Filetype as CSV and select your separator in the Content tab, and list out the fields you want to grab in the Fields tab. Using this step would solve your problem because in the Additional output fields tab you can specify a field in the stream to put your filename, extension, file path, etc.

The advantages you gain when using the CSV Input are:

  • NIO: Native system calls for reading the file means faster performance, but it is limited to only local files currently. No VFS support.
  • Parallel running: If you configure this step to run in multiple copies or in clustered mode, and you enable parallel running, each copy will read a separate block of a single file allowing you to distribute the file reading to several threads or even several slave nodes in a clustered transformation.
  • Lazy conversion: If you will be reading many fields from the file and many of those fields will not be manipulate, but merely passed through the transformation to land in some other text file or a database, lazy conversion can prevent Kettle from performing unnecessary work on those fields such as converting them into objects such as strings, dates, or numbers.

If you need those advantages then you'll have to get at your file name another way, such as passing it as a named parameter and adding it to the stream with the Get Variables step.

Reference information:

Upvotes: 5

Related Questions