I am currently contemplating on how to expose data present in Redshift tables in a meaningful and consistent way through REST API. The way I want it to work is that caller calls the API and then we do some kind of dynamic querying on the tables. I am worried about the latency as the queries could range from simple to very complicated ones. Since Redshift requires connecting to the database as a client, Some of the approaches we could have are: Create a lambda function connecting to Redshift, which is invoked through API gateway Using OData to create RESTful APIs. However, I don't think Redshift supports OData out of the box. I am inclining towards OData since it has advance filtering options along with pagination. I am seeking advice, will OData be enough and if yes, how exactly one integrates OData with redshift. Any other advise/approaches are welcome too. Thanks!

Reputation: 1047

Expose Redshift tables through REST API

I am currently contemplating on how to expose data present in Redshift tables in a meaningful and consistent way through REST API.

The way I want it to work is that caller calls the API and then we do some kind of dynamic querying on the tables. I am worried about the latency as the queries could range from simple to very complicated ones. Since Redshift requires connecting to the database as a client, Some of the approaches we could have are:

Create a lambda function connecting to Redshift, which is invoked through API gateway
Using OData to create RESTful APIs. However, I don't think Redshift supports OData out of the box.

I am inclining towards OData since it has advance filtering options along with pagination.

I am seeking advice, will OData be enough and if yes, how exactly one integrates OData with redshift. Any other advise/approaches are welcome too.

Thanks!

Upvotes: 3

Answers (1)

Farid Nouri Neshat

Reputation: 30430

Let me go over the different options:

Redshift data api

Redshift data API let's you invoke queries and get their results in an asynchronous manner.

You can use the API directly from front-end, or you can put it behind API Gateway.

Lambda

If you trust your users and can get a proper authentication you can simply invoke the Lambda directly from front-end and pass it some SQL to run or generate SQL based on the parameters. You can potentially swap this with Athena using federated query. Optionally you can add in API Gateway for some additional features like rate-limiting and different forms of authentication. Keep in mind that both Lambda and API Gateway have limit in terms of data returned and execution time.

For long running queries I would suggest that the Lambda, API Gateway or even from the front-end itself invoke an AWS Glue Python Shell Job which will use an unload query to drop the results in S3. The front-end can pool for when the job is done.

If you have few types of queries then you can make proper rest API.

Instead of Lambda, you can also use Amazon Athena Federated Query, which you can actually query with directly from the front-end.

OData Implementation

There are third party OData implementations for Redshift. Just google it. With a front-end library that consumes OData(I used KendoUI in the past) you can potentially make a working feature rich front-end in days. The main concern with this option is the tools costs may over your budget. Of course the hours you spent making things is also a cost but it really depends on what are your actual requirements.

So how to choose?

Depending on your requirements I would suggest simply going through the option and selecting them based on costs, time to implementation, performance, reliability and security.

How about Redshift performance?

This is the most difficult part about Redshift and on-demand queries. On Redshift you don't have indexes, data can be compressed and the data is stored columnar fashion. All of these can make Redshift slower than your average relational database for a random query.

However you can make sure that your table is sorted with a distribution style that matches your queries and your queries use the columnar storage to their advantage(not all columns are requested), then it can be faster.

Another thing to keep in mind is that Redshift doesn't handle concurrency well, I believe by default there can only 8 concurrent queries, which you can increase it but you definitely wouldn't to go more than 20.

If your users can wait for their queries(I've seen bad queries go over 2 hours. I'm sure you can make them take longer, then Redshift is fine, if not then you could try putting Postgres in front of Redshift by using external tables and then use your average indexes in front of it to speed things up.

Upvotes: 1