Dataprep importing files with different number of columns into a dataset

Question

I am trying to create a parameterized dataset that imports files from GCS and puts them under each other. This all works fine (Import Data > Parameterize).

To give a bit of context, I store each day a .csv file with a different name referring to that date.

Now it happens that my provider added a new column since last month into the files. This means that files before this date have 8 columns, whereas from this date 9 columns.

However, when I parameterize, Dataprep only takes into account the columns that are matching (thus 8 columns only). Ideally I would want empty observations for the rows coming from files that did not have this new column.

How can this be achieved?

Hugues · Accepted Answer

The parameterized datasets only work on a fixed schema as mentioned in the documentation:

Avoid creating datasets with parameters where individual files or tables have differing schemas.

This fixed schema is generated using one of the file found during the creation of the dataset with parameters.

If the schema has changed, then you can "refresh" it by editing the dataset with parameters and clicking save. If all the matching files contain 9 columns, you should now see 9 columns in the transformer.

Dataprep importing files with different number of columns into a dataset

Answers (1)

Related Questions