2lean4
2lean4

Reputation: 25

AWS Glue: How to partition S3 Bucket into multiple redshift tables

I have a basic AWS Glue job set up that reads from an S3 bucket with multiple folders:

S3://mybucket/table1
S3://mybucket/table2
S3://mybucket/table3

and so on. All of the files in these folders have the exact same format, and I want them to be inserted into different redshift tables in the same database (table1, table2, table3). There seems to be a way to automatically do this from S3 bucket to S3 bucket, but I can't seem to find documentation on how to do it for S3 to Redshift, is this at all possible?

The code I have at the moment is just the basic Glue template code generated for this job, and partition_0 contains the string representation of the folder name:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "test", table_name = "all_data_bucket", transformation_ctx = "datasource0")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("dataField1", "string", "dataField1", "string"), ("partition_0", "string", "partition_0", "string")], transformation_ctx = "applymapping1")

resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_cols", transformation_ctx = "resolvechoice2")

dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields3, catalog_connection = "REDSHIFT", connection_options = {"dbtable": "all_data_table", "database": "dev"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")
job.commit()

Upvotes: 2

Views: 537

Answers (1)

Emerson
Emerson

Reputation: 1166

1) crawl the data as three separate tables 2) using boto3 list the tables in that database 3) loop through the list and apply your glue code to load data to redshift

Upvotes: 1

Related Questions