Adviov
Adviov

Reputation: 480

Which database for dealing with very large result-sets?

I am currently working on a PHP application (pre-release).

Background

We have the a table in our MySQL database which is expected to grow extremely large - it would not be unusual for a single user to own 250,000 rows in this table. Each row in the table is given an amount and a date, among other things.

Furthermore, this particular table is read from (and written to) very frequently - on the majority of pages. Given that each row has a date, I'm using GROUP BY date to minimise the size of the result-set given by MySQL - rows contained in the same year can now be seen as just one total.

However, a typical page will still have a result-set between 1000-3000 results. There are also places where many SUM()'s are performed, totalling many tens - if not hundreds - of thousands of rows.

Trying MySQL

On a usual page, MySQL was usually taking around around 600-900ms. Using LIMIT and offsets weren't helping performance and the data has been heavily normalised, and so it doesn't seem like further normalisation would help.

To make matters worse, there are parts of the application which require the retrieval of 10,000-15,000 rows from the database. The results are then used in a calculation by PHP and formatted accordingly. Given this, the performance of MySQL wasn't acceptable.

Trying MongoDB

I have converted the table to MongoDB, and it's speed is faster - it usually takes around 250ms to retrieve 2,000 documents. However, the $group command in the aggregation pipeline - needed to aggregate fields depending on the year they fall in - slows things down. Unfortunately, keeping a total and updating that whenever a document is removed/updated/inserted is also out of the question, because although we can use a yearly total for some parts of the app, in other parts the calculations require that each amount falls on a specific date.

I've also considered Redis, although I think the complexity of the data is beyond what Redis was designed for.

The Final Straw

On top of all of this, speed is important. So performance is up there it terms of priorities.

Questions:

  1. What is the best way to store data which is frequently read/written and rapidly growing, with the knowledge that most queries will retrieve a very large result-set?
  2. Is there another solution to the problem? I'm totally open to suggestions.

I'm a little stuck at the moment, I haven't been able to retrieve such a large result-set in an acceptable amount of time. It seems most datastores are great for small retrieval sizes - even on large amounts of data - but I haven't been able to find anything on retrieving large amounts of data from an even larger table/collection.

Upvotes: 4

Views: 239

Answers (1)

Sammaye
Sammaye

Reputation: 43884

I only read the first two lines but you are using aggregation (GROUP BY) and then expecting it to just do realtime?

I will say you are new to the internals of databases not to undermine you but to try and help you.

The group operator in both MySQL and MongoDB is in-memory. In other words it takes whatever data structure you povide, whether it be an index or a document (row) and it will go through each row/document taking the field and grouping it up.

This means that you can speed it up in both MySQL and MongoDB by making sure you are using an index for the grouping, but still this only goes so far, even with housing the index in your direct working set in MongoDB (memory).

In fact using LIMIT with a OFFSET as well is probably just slowing things down even further frankly. Since after writing out the set MySQL then needs to query again to get your answer.

Once done it will write out the result, MySQL will write it out to a result set (memory and IO being used here) and MongoDB will reply inline if you have not set $out, the maximum size of the inline output being 16MB (the maximum size of a document).

The final point to take away here is: aggregation is horrible

There is no silver bullet that will save you here, some databases will attempt to boast about their speed etc etc but fact is most big aggregators use something called "pre-aggregated reports". You can find a quick introduction within the MongoDB documentation: http://docs.mongodb.org/ecosystem/use-cases/pre-aggregated-reports/

This means that you put the effort of aggregating and grouping onto some other process which could do it easily enough allowing your reading thread, the one that needs to be realtime to do it's thang in realtime.

Upvotes: 2

Related Questions