Reputation: 24356
I use a expectations
and Check
to determine if a column of decimal type could be transformed into int or long type. A column could be safely transformed if it contains integers or decimals where decimal part only contains zeros. I check it using regex function rlike
, as I couldn't find any other method using expectations
.
The question is, can I do such check for all columns of type decimal without explicitly listing column names? df.columns
is not yet available, as we are not yet inside the my_compute_function
.
from transforms.api import transform_df, Input, Output, Check
from transforms import expectations as E
@transform_df(
Output("ri.foundry.main.dataset.1e35801c-3d35-4e28-9945-006ec74c0fde"),
inp=Input(
"ri.foundry.main.dataset.79d9fa9c-4b61-488e-9a95-0db75fc39950",
checks=Check(
E.col('DSK').rlike('^(\d*(\.0+)?)|(0E-10)$'),
'Decimal col DSK can be converted to int/long.',
on_error='WARN'
)
),
)
def my_compute_function(inp):
return inp
Upvotes: 0
Views: 520
Reputation: 11
You are right in that df.columns
is not available before my_compute_function
's scope is entered. There's also no way to add expectations from runtime, so hard-coding column names and generating expectations is necessary with this method.
To touch on the first part of your question - in an alternative approach you could attempt decimal -> int/long
conversion in an upstream transform, store the result in a separate column and then use E.col('col_a').equals_col('converted_col_a')
.
This way you could simplify your Expectation
condition while also implicitly handling the cases in which conversion would under/over-flow as DecimalType
can hold arbitrarily large/small values (https://spark.apache.org/docs/latest/sql-ref-datatypes.html).
Upvotes: 1