Reputation: 1245
Can someone explain why we need transform
& transform_df
methods separately?
Upvotes: 11
Views: 2037
Reputation: 140
One addition to the answer of @Adil B.
@transform_df
can handle only one output, whereas in @transform can have multiple, but you are in chagre of writing the output:
from pyspark.sql import DataFrame
from transforms.api import transform_df, Input, Output
@transform_df(
Output("some_foundry_id"),
input_dataset=Input("another_foundy_id"),
)
def compute(input_dataset: DataFrame) -> DataFrame:
return input_dataset
the dataframe you return here will be saved by palantir in the output
from pyspark.sql import DataFrame
from transforms.api import transform, Input, Output
@transform(
input_1=Input("..."),
output_1=Output("..."),
output_2=Output("..."),
)
def compute(input_1: Input, output_1: Output, output_2: Output) -> None:
output_1.write_dataframe(input_1.dataframe())
output_2.write_dataframe(input_1.dataframe())
Upvotes: 3
Reputation: 16856
There's a small difference between the @transform
and @transform_df
decorators in Code Repositories:
@transform_df
operates exclusively on DataFrame
objects.@transform
operates on transforms.api.TransformInput
and transforms.api.TransformOutput
objects rather than DataFrame
s.If your data transformation depends exclusively on DataFrame
objects, you can use the @transform_df()
decorator. This decorator injects DataFrame
objects and expects the compute function to return a DataFrame.
Alternatively, you can use the more general @transform()
decorator and explicitly call the dataframe()
method to access a DataFrame
containing your input dataset.
Upvotes: 10