WJA
WJA

Reputation: 7004

Add dataset parameters into column to use them in BigQuery later with DataPrep

I am importing several files from Google Cloud Storage (GCS) through Google DataPrep and store the results in tables of Google BigQuery. The structure on GCS looks something like this:

//source/user/me/datasets/{month}/2017-01-31-file.csv
//source/user/me/datasets/{month}/2017-02-28-file.csv
//source/user/me/datasets/{month}/2017-03-31-file.csv

We can create a dataset with parameters as outlined on this page. This all works fine and I have been able to import it properly.

However, in this BigQuery table (output), I have no means of extracting only rows with for instance a parameter month in it.

How could I therefore add these Dataset Parameters (here: {month}) into my BigQuery table using DataPrep?

Upvotes: 3

Views: 622

Answers (2)

justbeez
justbeez

Reputation: 1387

While the original answers were true at the time of posting, there was an update rolled out last week that added a number of features not specifically addressed in the release notes—including another solution for this question.

In addition to SOURCEROWNUMBER() (which can now also be expressed as $sourcerownumber), there's now also a source metadata reference called $filepath—which, as you would expect, stores the local path to the file in Cloud Storage.

There are a number of caveats here, such as it not returning a value for BigQuery sources and not being available if you pivot, join, or unnest . . . but in your scenario, you could easily bring it into a column and do any needed matching or dropping using it.

NOTE: If your data source sample was created before this feature, you'll need to create a new sample in order to see it in the interface (instead of just NULL values).

Full notes for these metadata fields are available here: https://cloud.google.com/dataprep/docs/html/Source-Metadata-References_136155148

Upvotes: 2

JSDBroughton
JSDBroughton

Reputation: 4034

There is currently no access to data source location or parameter match values within the flow. Only the data in the dataset is available to you. (except SOURCEROWNUMBER())

Partial Solution

One method I have been using to mimic parameter insertion into the eventual table is to have multiple dataset imports by parameter and then union these before running your transformations into a final table.

For each known parameter search dataset, have a recipe that fills a column with that parameter per dataset and then union the results of each of these. Single Row formula

Obviously, this is only so scalable i.e. it works if you know the set of parameter values that will match. once you get to the granularity of time-stamp in the source file there is no way this is feasible.

Image of the flow described. In this example just the year value is the filtered parameter.


Longer Solution (An aside)

The alternative to this I eventually skated to was to define dataflow jobs using Dataprep, use these as dataflow templates and then run an orchestration function that ran the dataflow job (not dataprep) and amended the parameters for input AND output via the API. Then there was a transformation BigQuery Job that did the roundup append function.

Worth it if the flow is pretty settled, but not for adhoc; all depends on your scale.

Upvotes: 2

Related Questions