Reputation:
I will be building a little adding from Excel that would enable to let user perform ad-hoc reporting from Excel.
the obvious way would be via SQL statement against an ODBC DataSource, but I'm afraid it would not scale really well against a huge table (ie: doing a SUM or a COUNT against a table containing 50 gigas of datas will lead to performance issue)
I've been thinking about using some sort of caching technique, and basically querying in RAM data : http://www.yellowfinbi.com/wp_inmemory.pdf This can be costly in RAM and I'm not too sure how difficult it's to implement yet.
My Constraints are as follow :
I should be able to query any type of Data Source (SQL Server, MySql) from Excel
The ad-hoc reporting will be performed from an Excel (via my add-in)
The query result should be send back to Excel cells as fast as possible
I don't want to build Datawarehouse or Olap cubes, but query the data source
Any suggestions on algorithms, implementation techniques (In-Ram reporting ?) to have the most optimized way to perform ad-hoc querying form Excel ? (beside theoptimized SQL code!)
Upvotes: 1
Views: 2558
Reputation: 38384
"(ie: doing a SUM or a COUNT against a table containing 50 gigas of datas will lead to performance issue)"
This is exactly why you probably don't want to cache the data at the client. Unless each user has his own small supercomputer with 64 gb of RAM, then you need to reduce the number of rows coming across.
The fact is most users find a large report to be pretty useless. Our brains can only keep less than a dozen values in short term memory. Thus a user isn't going to be able to gleen anything useful from a huge amount of data.
They need to perform analysis, and usually analysis involves aggregated or filtered data.
Two options you can use either or in combo:
1) Implement views that pre-aggregate the data. Many DB engines have something like materialized views or the no expand option in SQL server that basically let's you pre-aggregate reports. If you are trying to avoid this, and let your addin handle this, then you are basically building a OLAP engine, and should look at algorithms used by OLAP systems(this is exactly what PowerPivot is already).
The idea is you let the DB engine do what it's good at, reduce the number of rows to an aggregated amount that's closer to what the client needs. This ensures that you are not sending a huge amount of data over the network and requiring the client to process that data and cache in ram.
2) Pass filter criteria with the WHERE criteria/SP parameters to reduce the number of rows returned only to those absolutely needed. There is room to get really creative here, more than most people reallize. For example, maybe your table contains some data the user wouldn't have access to, so it makes sense to join with the authorization table(or whatever mechanism you use for access control) and filter out any data he does not have access to.
Require the user to choose filters to filter the data down to a reasonable amount.
It is a common occurrence that a user will want more data than they can swallow.
Upvotes: 1
Reputation: 38384
PowerPivot is one option which is a free addin for Excel that is optimized for aggregating(summing/counting) data in this sort of way.
It does require sufficient RAM to cache the data from the server.
Upvotes: 2