DarioB
DarioB

Reputation: 1609

join datasets with tfx tensorflow transform

I am trying to replicate some data preprocessing that I have done in pandas into tensorflow transform. I have a few CSV files, which I joined and aggregated with pandas to produce a training dataset. Now as part of productionising the model I would like this preprocessing to be done at scale with apache beam and tensorflow transform. However it is not quite clear to me how I can reproduce the same data manipulation there. Let's look at two main operations: JOIN dataset a and dataset b to produce c and group by col1 on dataset c. This would be a quite straightforward operation in pandas, but how would I do this in tensorflow transform running on apache beam? Am I using the wrong tool for the job? What would be the right tool then?

Upvotes: 0

Views: 150

Answers (2)

Pritam Dodeja
Pritam Dodeja

Reputation: 326

In my opinion, tensorflow transform is likely not the right tool for this job. What would clarify it is if the two datasets are two independent streams that come in at inference time, and need to be merged somehow, an also need to be done in the tensor graph. The last part is the reason I'm saying transform is likely not the right tool, because a group by is not a tensor operation. Tensorflow transform encodes tensor operations into a tensor graph similar to the fit method in sklearn.preprocessing. Tensorflow transform is for things that require a full pass over the training dataset (e.g. compute and store mean, variance etc). Hope this clarifies things, let me know if you need further help with this.

Upvotes: 0

robertwb
robertwb

Reputation: 5104

You can use the Beam Dataframes API to do the join and other preprocessing exactly as you would have in Pandas. You can then use to_pcollection to get a PCollection that you can pass directly to your Tensorflow Transform operations, or save it as a file to read in later.

For top-level functions (such as merge) one needs to do

from apache_beam.dataframe.pandas_top_level_functions import pd_wrapper as beam_pd

and use operations beam_pd.func(...) in place of pd.func(...).

Upvotes: 0

Related Questions