Reputation: 3947
I'm still pretty new on Pentaho Soon. I'd like to know if this would be possible to be achieved.
In the past I had many bad experiences with SSIS, so I decided at the time to develop my own ETL using C#. In practice, .Net only does the extract step. Data is inserted to DB and the rest of ETL is done by MSSQL Query Engine from normal SQL in txt files, that .Net reads and executes in MSSQL.
My idea is to move from .Net to Java, and use Spoon features. The advantage is that I'd have Spoon's components avaiable. Table output in example.
One issue I have is that some flat files come currupted. In example, letters with accent are replaced by separator character, so I can't just tell ETL tool to split columns using the separator, I first need to verify how many separators are present and handle it if there are more than expected.
I also need to verify if a file was already processed or not, if it had finished being copied thru network, etc. I also don't want SQL code to be stored on Execute SQL Script components, I want them saved on normal txt files so that Subversion can track changes on them, and ETL tool should read these files and send them to MSSQL to be executed.
So, my idea would be to use Spoon's GUI to build the ETL normally. And then use Eclipse to develop over its SDK to customize the execution. In example, I'd use standard Text File Input component in GUI, but then my jar would have its own class extending the standard one, that customizes the method responsible for receiving a line string and splitting it into fields, and handle any issue.
My jar would instantiate my class instead of Spoon's and provide its object to the engine.
Is it viable, or too complext to bother?
Upvotes: 0
Views: 396
Reputation: 6356
It seams the right approach to me.
In the case of accents being replaced by separators, look first if it is not a file character set issue. If yes and it varies from file to file, put the charset in a variable or some logic by file name or directory.
If it is not charset issue, then read the file as one string rows and drop it in a javascript
step to use regexp and split. The javascript
is easier to handle than the java
or regex
steps which are good choices but less flexible to start with.
To look if the file have been processed or not, reinvent the wheel based on the following schema: put the processed file in a given directory, then use the File exists
step to process them or not (adapt freely to your case).
To execute SQL script contained in file, read their content in a field and give it to the Dynamic SQL Row
step.
You can put all that in a custom plugin, so each time you open spoon you will have your own step. This is really easy to do, but to tell the truth, there is a long time I do not do that any more. It is much easier to use the Transformation Executor
step in an other transform or in a job.
And while we are in that kind of questions, let me mention the Chapter 22 of the Pentaho Kettle Solution book which explains (with example) how to call kettle from java. That is great fun.
-- If you need further support, accept this answer, and post a new question. --
Upvotes: 1