Reputation: 891

How do you have AWS Glue ETL job return a single file with all the results in it using PySpark?

I have a very basic AWS Glue ETL job that I created to select some fields from a data catalog that was built from a crawler I have pointing to an RDS database. Once the dataset is returned I export the results in CSV format. This works, however; the output generates around 20 unique files . The dataset only has two rows in it right now so only two files have data and the rest just show the column headers with no second row. My requirement is to have a single CSV file that contains all of the data selected from the dataset. I have tried both repartition and coalesce functions unsuccessfully. I am able to generate the single file, but my data is missing. I am new to AWS Glue and have been unable to figure this out so any suggestions will be much appreciated.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue import DynamicFrame


def sparkSqlQuery(glueContext, query, mapping, transformation_ctx) -> DynamicFrame:
    for alias, frame in mapping.items():
        frame.toDF().createOrReplaceTempView(alias)
    result = spark.sql(query)
    return DynamicFrame.fromDF(result, glueContext, transformation_ctx)


args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# Script generated for node PostgreSQL
PostgreSQL_node1644981751584 = glueContext.create_dynamic_frame.from_catalog(
    database="newApp",
    table_name="database_schema_staging_hdr",
    transformation_ctx="PostgreSQL_node1644981751584",
)

# Script generated for node SQL
SqlQuery0 = """
select * from myDataSource

"""
SQL_node1644981807578 = sparkSqlQuery(
    glueContext,
    query=SqlQuery0,
    mapping={"myDataSource": PostgreSQL_node1644981751584},
    transformation_ctx="SQL_node1644981807578",
)

# Script generated for node Amazon S3
AmazonS3_node1644981816657 = glueContext.write_dynamic_frame.from_options(
    frame=SQL_node1644981807578,
    connection_type="s3",
    format="csv",
    connection_options={"path": "s3://awsglueetloutput/", "partitionKeys": []},
    transformation_ctx="AmazonS3_node1644981816657",
)


job.commit()

Upvotes: 1

Answers (2)

Matías Salinas Soto

Reputation: 11

I had the same issue but I didn't want to lose the Visual ETL Tool (if you edit the script you can't regret it and go back to the visual editor).

You need to add two nodes between the SQL Query and S3 Output:

SQL Query > Custom Transform > Select from Collection > S3 Output.

Custom Transform

def MyTransform (glueContext, dfc) -> DynamicFrameCollection:
    new_dfc = dfc.select(list(dfc.keys())[0]).repartition(1)
    return(DynamicFrameCollection({"CustomTransform0": new_dfc}, glueContext))

It generates a DynamicFrameCollection with your SQL using repartition(1).

Select form Collection

This node is necessary because S3 needs a DynamicFrame instead of a DynamicFrameCollection (generated in the Custom Transform).

With this workflow you will have a single file generated in S3.

Upvotes: 1

Robert Kossendey

Reputation: 7028

You have to repartition the DynamicFrame to achieve that.

Example to have 1 file in the end: SQL_node1644981807578 = SQL_node1644981807578.repartition(1)

Upvotes: 1

How do you have AWS Glue ETL job return a single file with all the results in it using PySpark?

Answers (2)

Custom Transform

Select form Collection

Related Questions