I am currently testing Redshift for a SaaS near-realtime analytics application. The queries performance are fine on a 100M rows dataset. However, the concurrency limit of 15 queries per cluster will become a problem when more users will be using the application at the same time. I cannot cache all aggregated results since we authorize to customize filters on each query (ad-hoc querying) The requirements for the application are: queries must return results within 10s ad-hoc queries with filters on more than 100 columns From 1 to 50 clients connected at the same time on the application dataset growing at 10M rows / day rate typical queries are SELECT with aggregated function COUNT, AVG with 1 or 2 joins Is Redshift not correct for this use case? What other technologies would you consider for those requirements?

Reputation: 21

Amazon Redshift for SaaS application

I am currently testing Redshift for a SaaS near-realtime analytics application. The queries performance are fine on a 100M rows dataset.

However, the concurrency limit of 15 queries per cluster will become a problem when more users will be using the application at the same time.

I cannot cache all aggregated results since we authorize to customize filters on each query (ad-hoc querying)

The requirements for the application are:

queries must return results within 10s
ad-hoc queries with filters on more than 100 columns
From 1 to 50 clients connected at the same time on the application
dataset growing at 10M rows / day rate
typical queries are SELECT with aggregated function COUNT, AVG with 1 or 2 joins

Is Redshift not correct for this use case? What other technologies would you consider for those requirements?

Upvotes: 1

Answers (2)

denismo

Reputation: 800

Redshift is very sensitive to the keys used in joins and group by/order by. There are no dynamic indexes, so usually you define your structure to suit the tasks.

What you need to ensure is that your joins match the structure 100%. Look at the explain plans - you should not have any redistribution or broadcasting, and no leader node activities (such as Sorting). It sounds like the most critical requirement considering the amount of queries you are going to have.
The requirement to be able to filter/aggregate on arbitrary 100 columns can be a problem as well. If the structure (dist keys, sort keys) don't match the columns most of the time, you won't be able to take advantage of Redshift optimisations. However, these are scalability problems - you can increase the number of nodes to match your performance, you just might be surprised of the costs of the optimal solution.

This may not be a serious problem if the number of projected columns is small, otherwise Redshift will have to hold large amounts of data in memory (and eventually spill) while sorting or aggregating (even in distributed manner), and that can again impact performance.

Beyond scaling, you can always implement sharding or mirroring, to overcome some queue/connection limits, or contact AWS support to have some limits lifted
You should consider pre-aggregation. Redshift can scan billions of rows in seconds as long as it does not need to do transformations like reordering. And it can store petabytes of data - so it's OK if you store data in excess

So in summary, I don't think your use case is not suitable based on just the definition you provided. It might require work, and the details depend on the exact usage patterns.

Upvotes: 0

Joe Harris

Reputation: 14045

This question was also posted on the Redshift Forum. https://forums.aws.amazon.com/thread.jspa?messageID=498430&#498430

I'm cross-posting my answer for others who find this question via Google. :)

In the old days we would have used an OLAP product for this, something like Essbase or Analysis Services. If you want to look into OLAP there is an very nice open source implementation called Mondrian that can run over a variety of databases (including Redshift AFAIK). Also check out Saiku for an OSS browser based OLAP query tool.

I think you should test the behaviour of Redshift with more than 15 concurrent queries. I suspect that it will not be user noticeable as the queries will simply queue for a second or 2.

If you prove that Redshift won't work you could test Vertica's free 3-node edition. It's a bit more mature than Redshift (i.e. it will handle more concurrent users) and much more flexible about data loading.

Hadoop/Impala is overly complex for a dataset of your size, in my opinion. It is also not designed for a large number of concurrent queries or short duration queries.

Shark/Spark is designed for the case where you data is arriving quickly and you have a limited set of metrics that you can pre-calculate. Again this does not seem to match your requirements.

Good luck.

Upvotes: 1

Amazon Redshift for SaaS application

Answers (2)

Related Questions