Reputation: 25
I have a basic AWS Glue job set up that reads from an S3 bucket with multiple folders:
S3://mybucket/table1
S3://mybucket/table2
S3://mybucket/table3
and so on. All of the files in these folders have the exact same format, and I want them to be inserted into different redshift tables in the same database (table1, table2, table3). There seems to be a way to automatically do this from S3 bucket to S3 bucket, but I can't seem to find documentation on how to do it for S3 to Redshift, is this at all possible?
The code I have at the moment is just the basic Glue template code generated for this job, and partition_0 contains the string representation of the folder name:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "test", table_name = "all_data_bucket", transformation_ctx = "datasource0")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("dataField1", "string", "dataField1", "string"), ("partition_0", "string", "partition_0", "string")], transformation_ctx = "applymapping1")
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_cols", transformation_ctx = "resolvechoice2")
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields3, catalog_connection = "REDSHIFT", connection_options = {"dbtable": "all_data_table", "database": "dev"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")
job.commit()
Upvotes: 2
Views: 537
Reputation: 1166
1) crawl the data as three separate tables 2) using boto3 list the tables in that database 3) loop through the list and apply your glue code to load data to redshift
Upvotes: 1