Reputation: 5813
In this documentation: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format.html#aws-glue-programming-etl-format-parquet
it mentions: "any options that are accepted by the underlying SparkSQL code can be passed to it by way of the connection_options map parameter."
However, how can I find out what those options are? There's not a clear mapping between the Glue code and the SparkQL code.
(Specifically, I want to figure out how to control the size of the resulting parquet files)
Upvotes: 0
Views: 544
Reputation: 5124
SparkSQL options for various DataSources can be looked up in DataFrameWriter
documentation (in Scala or pyspark docs). Datasource for writing parquet
seems to only take compression
parameter. For SparkSQL options when reading the data, have a look into DataFrameReader
class.
To control the size of your output files you should play with parallelism - like @Yuri Bondaruk commented - using for example coalesc
function.
Upvotes: 1