Cameron
Cameron

Reputation: 11

Importing JSON file into DynamoDB

I am new to AWS, DynamoDB, and Python so I am struggling with accomplishing this task. I am using Amazon Transcribe with video and getting output in a JSON file. I then wish to store this data in DynamoDB.

Currently I am using a Lambda function to automate the process when the JSON file is dumped into an S3 bucket. Whenever the process occurs, I get an error in CloudWatch:

[ERROR] ClientError: An error occurred (ValidationException) when calling the PutItem operation: One or more parameter values were invalid: Missing the key type in the item
Traceback (most recent call last):
  File "/var/task/lambda_function.py", line 79, in lambda_handler
    table.put_item(Item=jsonDict) # Adds string of JSON file into the database
  File "/var/runtime/boto3/resources/factory.py", line 520, in do_action
    response = action(self, *args, **kwargs)
  File "/var/runtime/boto3/resources/action.py", line 83, in __call__
    response = getattr(parent.meta.client, operation_name)(**params)
  File "/var/runtime/botocore/client.py", line 320, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/var/runtime/botocore/client.py", line 623, in _make_api_call
    raise error_class(parsed_response, operation_name)

Here is my Python code that is attempting to create a DynamoDB table and then parse the JSON file:

import boto3  # import to pull AWS SDK for Python
import json  # import API for Python to work with JSON files
import time  # import time fucntions
s3_client = boto3.client('s3')  # creates low-level service client to AWS S3
dynamodb = boto3.resource('dynamodb')  # creates resource client to AWS DynamoDB


# When a .JSON file is added into the linked S3 bucket, another JSON file is created which contains
# the information about the S3 Bucket and the name of the file that was added to the bucket

def lambda_handler(event, context):
    print(str(event))

    # Print the JSON file created by S3 into CloudWatch Logs when an item is added into the bucket

    bucket = event['Records'][0]['s3']['bucket']['name']

    # Here the name of the S3 bucket is assigned to the variable 'bucket'
    # by grabbing the name from the JSON file created

    json_file_name = event['Records'][0]['s3']['object']['key']

    # Here the name of the file itself is assigned to the varibale 'json_file_name'
    # again by grabbing the name of the added file from the JSON file

    tname = json_file_name[:-5]

    # Defines the name of the table being added to dynamodb by using the name of S3 JSON file
    # *Use of [:.5] will strip the last five characters off the end of the file name

    print(tname)

    # Prints the name of the table into CloudWatch Logs

    json_object = s3_client.get_object(Bucket=bucket,Key=json_file_name)

    # The json_object variable is assigned the values of the 'bucket' and the 'json_file_name'
    # This uses the boto3 client service and the rest of the script will reference the specified
    # S3 bucket and the JSON file that was added to the bucket

    jsonFileReader = json_object['Body'].read()

    # The jsonFileReader variable takes this object and read the body of JSON file

    jsonDict = json.loads(jsonFileReader)

    # Using the json.loads function, the arrays of the JSON file are converted into a string

    table = dynamodb.create_table(
        TableName=tname, ## Define table name from name of JSON file in S3
    KeySchema=[
        {
            'AttributeName': 'type', #Primary Key
            'KeyType': 'HASH'  #Partition Key
        }
    ],
    AttributeDefinitions=[
        {
            'AttributeName': 'type',
            'AttributeType': 'S' #AttributeType N meas 'Number'
        }

    ],
    ProvisionedThroughput=
        {
            'ReadCapacityUnits': 10000,
            'WriteCapacityUnits': 10000
        }
    )

#    table.meta.client.get_waiter('table_exists').wait(TableName=tname)
    print(str(jsonDict))


    table.meta.client.get_waiter('table_exists').wait(TableName=tname)

    table = dynamodb.Table(tname)  # Specifies table to be used

    table.put_item(Item=jsonDict)  # Adds string of JSON file into the database

I'm not very familiar with parsing nested JSON files and have no experience with DynamoDB. Any assistance to get this functioning would be extremely helpful!

Here is the JSON file I am trying to parse:

{
    "results": {
        "items": [{
            "start_time": "15.6",
            "end_time": "15.95",
            "alternatives": [{
                "confidence": "0.6502",
                "content": "Please"
            }],
            "type": "pronunciation"
        }, {
            "alternatives": [{
                "confidence": null,
                "content": "."
            }],
            "type": "punctuation"
        }, {
            "start_time": "15.95",
            "end_time": "16.2",
            "alternatives": [{
                "confidence": "0.9987",
                "content": "And"
            }],
            "type": "pronunciation"
        }, {
            "start_time": "16.21",
            "end_time": "16.81",
            "alternatives": [{
                "confidence": "0.9555",
                "content": "bottles"
            }],
            "type": "pronunciation"
        }, {
            "start_time": "16.81",
            "end_time": "17.01",
            "alternatives": [{
                "confidence": "0.7179",
                "content": "of"
            }],
            "type": "pronunciation"
        }, {
            "start_time": "17.27",
            "end_time": "17.36",
            "alternatives": [{
                "confidence": "0.6274",
                "content": "rum"
            }],
            "type": "pronunciation"
        }, {
            "start_time": "18.12",
            "end_time": "18.5",
            "alternatives": [{
                "confidence": "0.9977",
                "content": "with"
            }],
            "type": "pronunciation"
        }, {
            "start_time": "18.5",
            "end_time": "19.1",
            "alternatives": [{
                "confidence": "0.3689",
                "content": "tattoos"
            }],
            "type": "pronunciation"
        }, {
            "start_time": "19.11",
            "end_time": "19.59",
            "alternatives": [{
                "confidence": "1.0000",
                "content": "like"
            }],
            "type": "pronunciation"
        }, {
            "start_time": "19.59",
            "end_time": "20.22",
            "alternatives": [{
                "confidence": "0.9920",
                "content": "getting"
            }],
            "type": "pronunciation"
        }, {
            "start_time": "20.22",
            "end_time": "20.42",
            "alternatives": [{
                "confidence": "0.5659",
                "content": "and"
            }],
            "type": "pronunciation"
        }, {
            "start_time": "20.43",
            "end_time": "20.97",
            "alternatives": [{
                "confidence": "0.6694",
                "content": "juggle"
            }],
            "type": "pronunciation"
        }, {
            "start_time": "21.2",
            "end_time": "21.95",
            "alternatives": [{
                "confidence": "0.8893",
                "content": "lashes"
            }],
            "type": "pronunciation"
        }, {
            "alternatives": [{
                "confidence": null,
                "content": "."
            }],
            "type": "punctuation"
        }, {
            "start_time": "21.95",
            "end_time": "22.19",
            "alternatives": [{
                "confidence": "1.0000",
                "content": "And"
            }]
        }]
    }
}

The other issue I seem to have is how to deal with punctuation since AWS Transcribe does not assign time stamps to these items.

Any assistance is appreciated, thank you!

Upvotes: 1

Views: 6095

Answers (1)

Binil Jacob
Binil Jacob

Reputation: 501

The important thing in a database is the key , this should be unique to your data row. In your case [Option 1] if you intent to put each json file in a separate table name (tname), you will have to provide a set of unique values which in this case seems to be start_time . Alternatively [Option 2] you can have the same table with all the current and future data put here and keep the key as tname assummin this would be unique to the data.

Option 1

Replace 'AttributeName': 'type' in KeySchema and AttributeDefinitions to 'AttributeName': 'start_time'

##This is a way to batch write
with table.batch_writer() as batch:
    for item in jsonDict["results"]["items"]:   
        batch.put_item(Item=item)

Option 2

Here you should not be creating table everytime. Just create it once and then each entry is added to the database. In the code below the table name is "commonTable"

Replace 'AttributeName': 'type' in KeySchema and AttributeDefinitions to 'AttributeName': 'tname'

table.meta.client.get_waiter('table_exists').wait(TableName="commonTable")
table = dynamodb.Table(tname)  # Specifies table to be used
jsonDict['tname'] = tname      # this is also the key name 'tname'
table.put_item(Item=jsonDict)

Upvotes: 1

Related Questions