Selva
Selva

Reputation: 1153

How to load data into Redshift from a custom REST API

I am new to AWS and please forgive me if this question is asked previously.

I have a REST API which returns 2 parameters (name, email). I want to load this data into Redshift.

I thought of making a Lambda function which starts every 2 minutes and call the REST API. The API might return max 3-4 records within this 2 minutes.

So, under this situation is it okay to just do a insert operation or I have to still use COPY (using S3)? I am worried only about performance and error-free (robust) data insert.

Also, the Lambda function will start asynchronously every 2 mins, so there might be a overlap of insert operation (but there won't be an overlap in data).

At this situation and if I go with S3 option, I am worried the S3 file generated by previous Lambda invoke will be overwritten and a conflict occurs.

Long story short, what is the best practise to insert fewer records into redshift?

PS: I am okay with using other AWS components as well. I even looked into Firehose which is perfect for me but it can't load data into Private Subnet Redshift.

Thanks all in advance

Upvotes: 2

Views: 2575

Answers (3)

jwagun
jwagun

Reputation: 1

FYI, AWS now supports Data API feature.

As described in the official document, you can easily access Redshift data using HTTP request without JDBC connection anymore.

The Data API doesn't require a persistent connection to the cluster. Instead, it provides a secure HTTP endpoint and integration with AWS SDKs. You can use the endpoint to run SQL statements without managing connections. Calls to the Data API are asynchronous.

https://docs.aws.amazon.com/redshift/latest/mgmt/data-api.html

Here's the steps you need to use Redshift Data API

  1. Determine if you, as the caller of the Data API, are authorized. For more information about authorization, see Authorizing access to the Amazon Redshift Data API.

  2. Determine if you plan to call the Data API with authentication credentials from Secrets Manager or temporary credentials. For more information, see Choosing authentication credentials when calling the Amazon Redshift Data API.

  3. Set up a secret if you use Secrets Manager for authentication credentials. For more information, see Storing database credentials in AWS Secrets Manager.

  4. Review the considerations and limitations when calling the Data API. For more information, see Considerations when calling the Amazon Redshift Data API.

  5. Call the Data API from the AWS Command Line Interface (AWS CLI), from your own code, or using the query editor in the Amazon Redshift console. For examples of calling from the AWS CLI, see Calling the Data API with the AWS CLI.

Upvotes: -1

John Rotenstein
John Rotenstein

Reputation: 269540

Yes, it would be fine to INSERT small amounts of data.

The recommendation to always load via a COPY command is for large amounts of data because COPY loads are parallelized across multiple nodes. However, for just a few lines, you can use INSERT without feeling guilty.

If your SORTKEY is a timestamp and you are loading data in time order, there is also less need to perform a VACUUM, since the data is already sorted. However, it is good practice to still VACUUM the table regularly if rows are being deleted.

Upvotes: 3

Deepak Singhal
Deepak Singhal

Reputation: 10874

As you don't have much data; you can use either copy or insert. Copy command is more optimized for bulk insert .. its like giving u capability of batch insert..

both will work equally fine

Upvotes: 0

Related Questions