Reputation: 87210

What's a good way to structure a 100M record table for fast ad-hoc queries?

The scenario is quite simple, there are about 100M records in a table with 10 columns (kind of analytics data), and I need to be able to perform queries on any combination of those 10 columns. For example something like this:

how many records with a = 3 && b > 100 are there in past 3 months?

Basically all of the queries are going to be a kind of how many records with attributes X are there in time interval Y, where X can be any combination of those 10 columns.

The data will keep coming in, it is not just a pre-given set of 100M records, but it is growing over time.

Since the column selection can be completely random, creating indexes for popular combinations is most likely not possible.

The question has two parts:

How should I structure this in a SQL database to make the queries as fast as possible, and what are some general steps I can take to improve performance?
Is there any kind of NoSQL database that is optimized for this kind of search? I can think of only ElasticSearch, but I'm not it would perform very well on this large data set.

Upvotes: 7

Answers (6)

Stagg

Reputation: 2855

If you can't create an OLAP cube from the data, can you instead create a summary table based on the unique combinations of X and Y. If the time period Y is of a sufficient high granularity your summary table could be reasonably small. Obviously depends on the data.

Also, you should capture the queries that users run. It's generally the case that users say they want every possible combination, when in practice this is rarely what happens and most users queries can be satisfied from pre-calculated results. The summary table would be an option here again, you'll get some data latency with this option, but it could work.

Other options if possible would be to look at hardware. I've had good results in the past using Solid State Drives such as Fusion-IO. This can reduce query time massively. This is not a replacement for good design, but with good design and the right hardware it works well.

Upvotes: 0

tom_b

Reputation: 73

In addition to the above suggestions, consider just querying an updated materialized view. I think I would just create a select ,count(*) group by cube () materialized view on the table.

This will give you a full cube to work with. Play around with this on a small test table to get the feel of how the cube rollups work. Check out Joe Celko's books for some examples or just hit your specific RDBMS documentation for examples.

You are a little stuck if you have to always be able to query the most up-to-the-microsecond data in your table. But if you can relax that requirement, you'll find a materialized view cube a pretty decent choice.

Are you absolutely certain that your users will hit all the 10 columns in a uniform way? I have dinged myself with premature optimization in the past for this type of situation, only to find that users really used one or two columns for most of their reports and that rolling up to those one or two colunmns was 'good enough.'

Upvotes: 0

Richard B

Reputation: 1

to get these queries to run fast using SQL solutions use these rules of thumb. There are lots of caveats with this though, and the actual SQL engine you are using will be very relevant to the solution.

I am assuming that your data is integer, dates or short scalers. long strings etc change the game. I'm also assuming you are only using fixed comparisons (=, <, >, <>, etc)

a) If time interval Y will be present in every query, make sure it is indexed, unless the Y predicate is selecting a large percentage of rows. Ensure rows are stored in "Y" order, so that they get packed on the disk next to each other. This will happen naturally anyway over time for new data. If the Y predicate is very tight (ie few hundred rows) then this might be all you need to do.

b) Are you doing a "select " or "select count()" ? If not "select *" then vertical partitioning MAY help depending on the engine and other indexes present.

c) Create single column indexes for each column where the values are widely distributed and dont have too many duplicates. Index YEAR_OF_BIRTH would generally be OK, but indexing FEMALE_OR_MALE is often not good - although this is highly database engine specific.

d) If you have columns like FEMALE_OR_MALE and "Y predicates" are wide, you have a different problem - selecting count of number of females from most of the rows will behard. You can try indexing, but depends on engine.

e) Try and make columns "NOT NULL" if possible - typically saves 1 bit per row and can simplify internal optimiser operation.

f) Updates/inserts. Creating indexes often hurts insert performance, but if your rate is low enough it might not matter. With only 100M rows, I'll assume your insert rate is reasonably low.

g) Multi-segment keys would help, but you've already said they are no go.

h) Get high speed disks (RPM) - the problem for these types of queries is usually IO (TPC-H benchmarks are about IO, and you are sounding like a "H" problem)

There are lots more options, but it depends how much effort you want to expend "to make the queries as fast as possible". There are lots of No-SQL and other options to solve this, but I'll leave that part of the question to others.

Upvotes: 0

David Aldridge

Reputation: 52346

As far as Oracle is concerned this would most likely be structured as an interval partitioned table with local bitmap indexes on each column that you might query, and new data being added either through a direct path insert or partition exchange.

Queries for popular combinations of columns could be optimised with a set of materialised views, possibly using rollup or cube queries.

Upvotes: 0

APC

Reputation: 146219

Without indexes your options for tuning an RDBMS to support this kind of processing are severely limited. Basically you need massive parallelism and super-fast kit. But clearly you're not storing realtional data so an RDBMS is the wrong fit.

Pursuing the parallel route, the industry standard is Hadoop. You can still use SQL style queries through Hive.

Another noSQL option would be to consider a columnar database. These are an alternative way of organising data for analytics without using cubes. They are good at loading data fast. Vectorwise is the latest player in the arena. I haven't used it personally, but somebody at last night's LondonData meetup was raving to me about it. Check it out.

Of course, moving away from SQL databases - in whatever direction you go - will incur a steep learning curve.

Upvotes: 1

Diego

Reputation: 36146

you should build a SSAS cube and use MDX to query it.

The cube has "aggregations" witch means results calculated ahead of time.Dependiong on how you configure your cube (and your aggregations), you can have a SUM attribute (A for example) on a measure group and every time you ask the cube how many records A has, it will just read the aggregation instead of reading all the table and calculate it.

Upvotes: 0

What&#39;s a good way to structure a 100M record table for fast ad-hoc queries?

Answers (6)

Related Questions

What's a good way to structure a 100M record table for fast ad-hoc queries?