Reputation: 391
We are designing an Big data solution for one of our dashboard applications and seriously considering Glue for our initial ETL. Currently Glue supports JDBC and S3 as the target but our downstream services and components will work better with dynamodb. We are wondering what is the best approach to eventually move the records from Glue to Dynamo.
Should we write to S3 first and then run lambdas to insert the data into Dynamo? Is that the best practice? OR Should we use a third party JDBC wrapper for Dynamodb and use Glue to directly write to Dynamo (Not sure if this is possible, sounds a bit scary) OR Should we do something else?
Any help is greatly appreciated. Thanks!
Upvotes: 12
Views: 20033
Reputation: 21
Consider your data is in now tabular format (CSV/Excel) and the Data Source is S3. Then this is how you can move the data from Glue to DynamoDB.
The majority of the work is done in the Glue itself.
Create a crawler in the Glue and name the database, while creating the crawler and run that crawler after creating one. (This will create the schema for the data you are giving). If you have any doubt in creating the crawler go through this: https://docs.aws.amazon.com/glue/latest/ug/tutorial-add-crawler.html#:~:text=To%20create%20a%20crawler%20that,Data%20Crawler%20%2C%20and%20choose%20Next.
Go to the left pane of AWS Glue under the ETL section click on the jobs.
Click on the create job, Once done, remove the Data Target - S3, because we want our data target to be the DynamoDB.
Now click on the data source - S3 Bucket and modify the changes like add the S3 file location and apply the transform settings based on your need. Enter the data input Make sure, there are no red indications.
Now, the answer to your question comes here: Go to the script, click on the edit script and add this function in the existing code.
glue_context.write_dynamic_frame_from_options(
frame=<name_of_the_Dataframe>,
connection_type="dynamodb",
connection_options={
"dynamodb.output.tableName": "<DynamoDB_Table_Name>",
"dynamodb.throughput.write.percent": "1.0"
}
)
Make sure you had changed the:
frame=<name_of_the_Dataframe> "dynamodb.output.tableName": "<DynamoDB_Table_Name>" DynamoDB_Table_Name - One you had created in the DynamoDB. name_of_the_Dataframe - This will be generated automatically, check out the variable name in the first function.
Once all the above steps are done, click on the save and run the script, and refresh the DynamoDB table. This is "how", you can load the data from the Amazon S3 service to DynamoDB.
Note: The column name/feature name should not init cap.
Upvotes: 2
Reputation: 119
You can add the following lines to your Glue ETL script:
glueContext.write_dynamic_frame.from_options(frame =DynamicFrame.fromDF(df, glueContext, "final_df"), connection_type = "dynamodb", connection_options = {"tableName": "pceg_ae_test"})
df should be of type DynamicFrame
Upvotes: 10
Reputation: 207
I am able to write using boto3... definitly its not best approach to load but its working one. :)
dynamodb = boto3.resource('dynamodb','us-east-1') table =
dynamodb.Table('BULK_DELIVERY')
print "Start testing"
for row in df1.rdd.collect():
var1=row.sourceCid
print(var1) table.put_item( Item={'SOURCECID': "{}".format(var1)} )
print "End testing"
Upvotes: 1
Reputation: 191
For your workloads, Amaon actually recommens using data pipelines.
It bypasses glue. So it is mostly used to load S3 files to Dynamo. But it may work.
Upvotes: -1