Reputation: 12007
I have read many blog and articles about the pros and cons of Amazon EC2 versus Microsoft Azure (and Google's App Engine). However, I am trying to decide which would better suite my particular case.
I have a data set - which can be thought of as a standard table of the format:
[id] [name] [d0] [d1] [d2] .. [d63]
---------------------------------------
0 Name1 0.43 -0.22 0.11 -0.81
1 Name2 0.23 0.65 0.62 0.41
2 Name3 -0.13 -0.23 0.17 0.00
...
N NameN 0.43 -0.23 0.12 0.01
I ultimately want to do something that (despite my final chosen stack) would equate to an SQL SELECT
statement similar to:
SELECT name FROM [table] WHERE (d0*QueryParameter1) + (d1*QueryParameter1) +(d2*QueryParameter2) + ... + (dN*QueryParameterN) < 0.5
where QueryParameter1,2,N
are parameters supplied at runtime, and change each time the query is run (so caching is out of the question).
My main concern is with the speed of the query, so I would like advice on which cloud stack option would provide the fastest query result possible.
I can do this a number of ways:
MySQL
. Again this is probably dependent on the type of query.I'd like to find the best stack combination to optimize my specific need (outlined by the pseudo SQL
query above).
Does anyone have any experience in this? Which stack option would result in the fastest query containing many math operators in the WHERE
clause?
Cheers, Brett
Upvotes: 2
Views: 490
Reputation: 14605
Your type of query with dynamic coefficients (weights) will require the entire table to be scanned on every query. A SQL database engine is not going to help you here, because there is really nothing that the query optimizer can do.
In other words, what you need is NOT a SQL database, but really a "NoSQL" database which really optimizes table/row access to the fastest speed possible. So you really shouldn't have to try SQL Azure and MySQL to find out this part of the answer.
Also, each row in your type of query is completely independent from each other, so it lends itself to simple parallelism. Your choice of platform should be whichever gives you:
Each platform you mentioned gives you ability to store huge amounts of blob or table-like data for very fast scan retrieval (e.g. table storage in Azure). Each also gives you the ability to "spin up" multiple instances to process them in parallel. It really depends on which programming environment you're most comfortable in (e.g. Java in Google/Amazon, .NET in Azure). In essence they all do the same thing.
My personal recommendation is Azure, since you can:
Azure does what you requires in a very "no-frills" way -- providing just enough infrastructure for you to do your job, and nothing more.
Upvotes: 2
Reputation: 66882
Assuming that the QueryParameter0, QueryParameter1, ... , QueryParameterN are all supplied at runtime and are different each time, then I don't think that any of the platforms will be able to provide significant advantages over any of the others - since none of them will be able to take advantage of any pre-computed indicies.
With indicies removed, the only other factors for speed then comes from the processing power available - you already know about this for the SQL Azure option, and for the other options this pretty much comes down to you deciding what processing to apply - it's up to you to fetch all the data and to then process it.
One option you might consider is whether you could host this data yourself on an instance (e.g. using an Azure blob or cloud drive) and could then process the data in a custom built worker role. This isn't something I'd think about for general data storage, but if its just this one table and this one query then it would be pretty easy to hand craft a quick solution?
Update - just seen the answer from @Cade too - +1 for his suggestion of parallelization.
Upvotes: 0
Reputation: 89651
The problem is not the math operators or the number thereof, the problem is that they are parameterized - you are effectively doing a weighted average across the columns with the weights being defined at run-time, so that the operation must be computed and cannot be inferred.
Even in SQL Server, this operation can be parallelized (and this should show up on the execution plan), but it is not amenable to search optimization using indexes, which is where most relational databases will really shine. With static weights and indexed computed column would obviously perform very quickly.
Because this problem is easily parallelized, you might want to look at something based on a Map-Reduce principle.
Upvotes: 1
Reputation: 10296
Currently neither SQL Azure nor Amazon RDS can scale horizontally (EC2 can at least vertically) but IF and only IF your data can be partitioned in a way that still makes it possible to execute your query the upcoming SQL Federations Feature of SQL Azure might be worth looking at and help making an informed decision.
MongoDB (which I like a lot) is more geared toward Document oriented workloads and is possible not the best solution for this type of job although your mileage may vary (it's blazingly fast as long as most of your working set fits into memory).
Upvotes: 0