Viktor
Viktor

Reputation: 3080

How to log "current status" of ETL job?

I'm running Kiba ETL pipeline in a rails background job. I'd like to provide some status to the user while the job is running. What would be the best way to achieve this?

Can I use some variable somehow?

Or should I save the status update in the database after every step (once in source, once for every transform, once in destination)? Once for every transformation seems like a lot of additional db writing and also, it seems a bit "dirty" to talk to the database from transform.

Thanks!

Upvotes: 1

Views: 323

Answers (1)

Thibaut Barrère
Thibaut Barrère

Reputation: 8873

To implement that type of use-case, you have to incorporate some form of progress tracking in your job.

It could report to a database record (which would modelize the job - recommended if you are doing a bit heavy-weight imports and want to be able to search afterwards), but you can also report to some form of pub-sub system (redis, Postgres, ActionCable...) if you want something more instant & more lightweight.

A transform is actually a great place to track progress, but this does not mean you have to report at every single row (because it would cause a SQL write at each row, which is usually too much!).

What I recommend is to report the progress only every N rows, using code like this:

pre_process do
  @count ||= 0
end

transform do |r|
  @count += 1
  if @count % 500 == 0
    # TODO here: notify the report system
  end
  r
end

You will want to think about what happens if an error occurs while you are notifying the report system: maybe you want to halt everything, or maybe you want to continue.

Make sure also to track the beginning of the job, the end of the job (success/error/completeness) to make sure you don't end up with stale jobs.

It seems a bit "dirty" to talk to the database, but only because we are mixing concerns a bit. If you do it every N rows & make sure not to pollute the main system, it's perfectly fine!

Upvotes: 2

Related Questions