Reputation: 3242
I would like to test my AWS Glue PySpark job with a small subset of the data available. How can this be achieved?
My first try was to convert the Glue dynamic frame to a spark data frame, and use the take(n) method to limit the number of rows to be processed as follows:
datasource0 = glueContext.create_dynamic_frame.from_catalog(
database = "my_db",
table_name = "my_table",
transformation_ctx = "ds0")
applymapping1 = ApplyMapping.apply(
frame = datasource0,
mappings = [("foo", "string", "bar", "string")],
transformation_ctx = "am1")
truncated_df = applymapping1.toDF().take(1000)
datasink2 = glueContext.write_dynamic_frame.from_options(
frame = DynamicFrame.fromDF(truncated_df, glueContext, "tdf"),
connection_type = "s3",
... )
job.commit()
This fails with the following error message:
AttributeError: 'list' object has no attribute '_jdf'
Any ideas?
Upvotes: 2
Views: 4685
Reputation: 41
df.take(1000)
returns a list.
Try using applymapping1.toDF().limit(1000)
.
Upvotes: 1
Reputation: 207
Try converting data seperatly and then use dynamic frame name in datasink
Upvotes: 2