user152468
user152468

Reputation: 3242

AWS Glue Limit Input Size

I would like to test my AWS Glue PySpark job with a small subset of the data available. How can this be achieved?

My first try was to convert the Glue dynamic frame to a spark data frame, and use the take(n) method to limit the number of rows to be processed as follows:

datasource0 = glueContext.create_dynamic_frame.from_catalog(
    database = "my_db",
    table_name = "my_table",
    transformation_ctx = "ds0")

applymapping1 = ApplyMapping.apply(
    frame = datasource0, 
    mappings = [("foo", "string", "bar", "string")],
    transformation_ctx = "am1")

truncated_df = applymapping1.toDF().take(1000)

datasink2 = glueContext.write_dynamic_frame.from_options(
    frame = DynamicFrame.fromDF(truncated_df, glueContext, "tdf"),
    connection_type = "s3", 
    ... )

job.commit()

This fails with the following error message:

AttributeError: 'list' object has no attribute '_jdf'

Any ideas?

Upvotes: 2

Views: 4685

Answers (2)

K_at_play
K_at_play

Reputation: 41

df.take(1000) returns a list. Try using applymapping1.toDF().limit(1000).

Upvotes: 1

Vinay Agarwal
Vinay Agarwal

Reputation: 207

Try converting data seperatly and then use dynamic frame name in datasink

Upvotes: 2

Related Questions