Narfanator
Narfanator

Reputation: 5813

What are the SparkQL options for com.amazonaws.services.glue.writeDynamicFrame?

In this documentation: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format.html#aws-glue-programming-etl-format-parquet

it mentions: "any options that are accepted by the underlying SparkSQL code can be passed to it by way of the connection_options map parameter."

However, how can I find out what those options are? There's not a clear mapping between the Glue code and the SparkQL code.

(Specifically, I want to figure out how to control the size of the resulting parquet files)

Upvotes: 0

Views: 544

Answers (1)

botchniaque
botchniaque

Reputation: 5124

SparkSQL options for various DataSources can be looked up in DataFrameWriter documentation (in Scala or pyspark docs). Datasource for writing parquet seems to only take compression parameter. For SparkSQL options when reading the data, have a look into DataFrameReader class.

To control the size of your output files you should play with parallelism - like @Yuri Bondaruk commented - using for example coalesc function.

Upvotes: 1

Related Questions