Reputation: 434
We are looking to build a single pipeline within a code repository that cleans, harmonizes, and transforms data to features of interest. We would like to apply that single pipeline code on different inputs and then test how the outputs look.
For example, we would like to test the pipeline on synthetic data, version 1 of 'real' data that includes only retrospective data, and version 2 of 'real' data that includes retrospective and prospective data. The comparison of the outputs could be what percent of patients had diabetes in version 1 compared to version 2.
I saw that you could template code repositories in foundry. Is this a viable option? Could you template your code repository and apply to the three scenarios I have provided? Is there a better option?
Upvotes: 2
Views: 192
Reputation: 1747
If your data scale is reasonably small, I would recommend going down the test-driven path of development here instead of trying to compare and contrast results across a wide variety of datasets. You'll find the iteration time and difficulty in exactly comparing results probably quite high.
For this, you should follow the method I lay out here and create representative datasets for each input you expect as a .csv
file in your repo, then you can incorporate these schemas as a unique input to your core code and inspect the outputs with ease.
This will let you 'tighten' your code much easier and faster, after which you can then run this logic on real full-scale data and generate your outputs as you wish.
Templating code is possible but should be incorporated with great care. If what you're truly solving for is comparing and contrasting the execution of your code on arbitrary schemas, then you should use test-driven in-repo development. If what you're after is running a core set of logic across a wide variety of outputs after the code is working, then generated transforms is going to work great. If what you're really after is rolling out a large codebase of transformations across differently-permissioned projects where each needs to be completely independent / configured separately of the other, then maybe you should consider templates. I would stick to test-driven development and generated transforms until you prove otherwise.
Upvotes: 0