Brett
Brett

Reputation: 12007

Cloud Architecture Stack Opinions - EC2 versus Azure

I have read many blog and articles about the pros and cons of Amazon EC2 versus Microsoft Azure (and Google's App Engine). However, I am trying to decide which would better suite my particular case.

I have a data set - which can be thought of as a standard table of the format:

[id]  [name]  [d0]  [d1]  [d2] .. [d63]
---------------------------------------
0     Name1   0.43 -0.22  0.11   -0.81
1     Name2   0.23  0.65  0.62    0.41
2     Name3  -0.13 -0.23  0.17    0.00
...
N     NameN   0.43 -0.23  0.12    0.01

I ultimately want to do something that (despite my final chosen stack) would equate to an SQL SELECT statement similar to:

SELECT name FROM [table] WHERE (d0*QueryParameter1) + (d1*QueryParameter1) +(d2*QueryParameter2) + ... + (dN*QueryParameterN) < 0.5

where QueryParameter1,2,N are parameters supplied at runtime, and change each time the query is run (so caching is out of the question).

My main concern is with the speed of the query, so I would like advice on which cloud stack option would provide the fastest query result possible.

I can do this a number of ways:

I'd like to find the best stack combination to optimize my specific need (outlined by the pseudo SQL query above).

Does anyone have any experience in this? Which stack option would result in the fastest query containing many math operators in the WHERE clause?

Cheers, Brett

Upvotes: 2

Views: 490

Answers (4)

Stephen Chung
Stephen Chung

Reputation: 14605

Your type of query with dynamic coefficients (weights) will require the entire table to be scanned on every query. A SQL database engine is not going to help you here, because there is really nothing that the query optimizer can do.

In other words, what you need is NOT a SQL database, but really a "NoSQL" database which really optimizes table/row access to the fastest speed possible. So you really shouldn't have to try SQL Azure and MySQL to find out this part of the answer.

Also, each row in your type of query is completely independent from each other, so it lends itself to simple parallelism. Your choice of platform should be whichever gives you:

  1. Table/row scan at the fastest speed
  2. Ability to highly parallelize your operation

Each platform you mentioned gives you ability to store huge amounts of blob or table-like data for very fast scan retrieval (e.g. table storage in Azure). Each also gives you the ability to "spin up" multiple instances to process them in parallel. It really depends on which programming environment you're most comfortable in (e.g. Java in Google/Amazon, .NET in Azure). In essence they all do the same thing.

My personal recommendation is Azure, since you can:

  1. Store massive amounts of data in "table storage", optimized for fast scan retrieval, and partitioned (e.g. over d0 ranges) for optimal parallelism
  2. Dynamically "spin up" as many compute instances as you like to process the data in parallel
  3. Queueing mechanisms to synchronize the results collation

Azure does what you requires in a very "no-frills" way -- providing just enough infrastructure for you to do your job, and nothing more.

Upvotes: 2

Stuart
Stuart

Reputation: 66882

Assuming that the QueryParameter0, QueryParameter1, ... , QueryParameterN are all supplied at runtime and are different each time, then I don't think that any of the platforms will be able to provide significant advantages over any of the others - since none of them will be able to take advantage of any pre-computed indicies.

With indicies removed, the only other factors for speed then comes from the processing power available - you already know about this for the SQL Azure option, and for the other options this pretty much comes down to you deciding what processing to apply - it's up to you to fetch all the data and to then process it.

One option you might consider is whether you could host this data yourself on an instance (e.g. using an Azure blob or cloud drive) and could then process the data in a custom built worker role. This isn't something I'd think about for general data storage, but if its just this one table and this one query then it would be pretty easy to hand craft a quick solution?


Update - just seen the answer from @Cade too - +1 for his suggestion of parallelization.

Upvotes: 0

Cade Roux
Cade Roux

Reputation: 89651

The problem is not the math operators or the number thereof, the problem is that they are parameterized - you are effectively doing a weighted average across the columns with the weights being defined at run-time, so that the operation must be computed and cannot be inferred.

Even in SQL Server, this operation can be parallelized (and this should show up on the execution plan), but it is not amenable to search optimization using indexes, which is where most relational databases will really shine. With static weights and indexed computed column would obviously perform very quickly.

Because this problem is easily parallelized, you might want to look at something based on a Map-Reduce principle.

Upvotes: 1

Oliver Weichhold
Oliver Weichhold

Reputation: 10296

Currently neither SQL Azure nor Amazon RDS can scale horizontally (EC2 can at least vertically) but IF and only IF your data can be partitioned in a way that still makes it possible to execute your query the upcoming SQL Federations Feature of SQL Azure might be worth looking at and help making an informed decision.

MongoDB (which I like a lot) is more geared toward Document oriented workloads and is possible not the best solution for this type of job although your mileage may vary (it's blazingly fast as long as most of your working set fits into memory).

Upvotes: 0

Related Questions