Andy
Andy

Reputation: 23

Can I duplicate rows with kiba using a transform?

I'm currently using your gem to transform a csv that was webscraped from a personel-database that has no api.

From the scraping I ended up with a csv. I can process it pretty fine using your gem, there's only one bit I am wondering

Consider the following data:

====================================
| name  |  article_1   | article_2 |
------------------------------------
| Andy  |  foo         | bar       |
====================================

I can turn this into this:

======================
| name  |  article   |
----------------------
| Andy  |  foo       |
----------------------
| Andy  |  bar       |
======================

(I used this tutorial to do this: http://thibautbarrere.com/2015/06/25/how-to-explode-multivalued-attributes-with-kiba/)

I'm using the normalizelogic on my loader for this. The code looks like: source RowNormalizer, NormalizeArticles, CsvSource, 'RP00119.csv' transform AddColumnEntiteit, :entiteit, "ocmw"


What I am wondering, can I achieve the same using a transform? So that the code would look like this:

source CsvSource, 'RP00119.csv'
transform NormalizeArticles
transform AddColumnEntiteit, :entiteit, "ocmw"

So question is: can I achieve to duplicate a row with a transform class?

Upvotes: 2

Views: 216

Answers (1)

Thibaut Barrère
Thibaut Barrère

Reputation: 8873

EDIT: Kiba 2 supports exactly what you need. Check out the release notes.

In Kiba as currently released, a transform cannot yet more than one row - it's either one or zero.

The Kiba Pro offering I'm building includes a multithreaded runner which happens (by a side-effect rather than as actual goal) to allow transforms to yield an arbitrary number of rows, which is what you are looking after.

But that said, without Kiba Pro, here are a number of techniques which could help.

The first possibility is to split your ETL script into 2. Essentially you would cut it at the step where you want to normalize the articles, and put a destination here instead. Then in your second ETL script, you would use a source able to explode the row into many. This is I think what I'd recommend in your case.

If you do that, you can use either a simple Rake task to invoke the ETL scripts as a sequence, or you can alternatively use post_process to invoke the next one if you prefer (I prefer the first approach because it makes it easier to run either one or another).

Another approach (but too complicated for your current scenario) would be to declare the same source N times, but only yield a given subset of data, e.g.:

pre_process do
  field_count = number_of_exploded_columns # extract from CSV?
end

(0..field_count).each do |shard|
  source MySource, shard: shard, shard_count: field_count
end

then inside MySource you would only conditionnally yield like this:

yield row if row_index % field_count == shard

That's the 2 patterns I would think of!

I would definitely recommend the first one to get started though, more easy.

Upvotes: 2

Related Questions