Reputation: 3625
I am developing a little data warehouse system with a web interface where people can do filtered searches. There are current about 50 columns that people may wish to filter on, and about 2.5 million rows. A table scan is painfully slow. The trouble is that the range of queries I'm getting have no common prefixes.
Right now I'm using sqlite3, which will only use an index if there the columns required are the leftmost columns in that index. This seems to mean I'd need a lot of indexes. A quick glance at MySQL suggests it would also require many indexes for this kind of query.
My question is what indexing implementations are available for different database systems which can handle this kind of query on arbitrary combinations of columns?
I've prototyped my own indexing scheme; I store extra tables which list integer primary keys in my big table where each value for each column occur, and I keep enough statistics to be able to first examine the values with the smallest number of matches. It works okay; much better than a table scan but still a bit on the slow side, which is unsurprising for a first version in Python doing many SQL queries.
Upvotes: 1
Views: 412
Reputation: 294387
There are column-oriented databases that store data on a per-column base, where every column is its own index. They are a very good fit for Data Warehouse as they are extremly fast to read, but fairly slow to update.
Kickfire is such an example, which is a customized MySQL engine and has held the TPC-H benchmark top crown for a number of weeks, at an impressive system cost. Note that Kickfire is an appliance, sold as a hardware box.
Infobright would be another similar example, and has a free community edition that runs on Windows and Linux.
Upvotes: 2
Reputation: 75235
One should only consider introducing "home grown" index structures, based on SQL tables, as a last resort, i.e. if there still exists [business-wise plausible] query cases not properly handled with an traditional index setting. For example if the list of such indexes were to become to big etc.
A few observations
You do not necessarily need indexes that include all of the columns that may be involved in one particular query; only the [collectively] selective ones may be required.
In other words if the query uses, for example, columns a, b, c and d, but if an index with a and b exists and if that produces, statistically only a few thousand rows, it may be acceptable to not introduce indexes with a, b and c (or and d or both), if c or d are not very plausible search criteria (used infrequently), and if their width is such that is would unduly burden the a+b index (or if there were other columns with a better fit for being "tacked-on" to the a+b index).
Aside from the obvious additional demand they put on disk storage, additional indexes, while possibly helping with SELECT (read) operations may also become an impediment with CUD (Create/Update/Delete) operations. It appears the context here is akin to a datawarehouse, where few [unscheduled] CUD operations take place, but it is good to keep this in mind.
See SQLite Optimizer for valuable insight into the way SQLite determines the way a particular query is executed.
Making a list of indexes
A tentative basis for the index scheme for this application may look like this:
From this basis we then can define the actual list of indexes needed by:
In this situation, I find a hand-written tree structure a useful tool to help manage the otherwise unmanageable lists of possible combinations. Assuming a maximum of 4 search criteria selected from the 50 columns indicated in the question, we have in excess of 230,000 combinations to consider... The tree helps prune this rather quickly.
Upvotes: 0
Reputation: 3584
When there's too many indexes to create for a table I usually fall back on Full Text Search. Can't say if it will fit your scenario though.
Upvotes: 1
Reputation: 96600
SInce data warehouses are typically optimized for reading data not writing it, I would consider simply indexing all the columns. Yes this will slow down putting data into the warehouse, but typically that happens during non-peak hours and only once a day or less often.
Upvotes: 0