I am going to generate simple CSV file report in Java using Hibernate and MySQL . I am using Native SQL (because query is too complex which is not possible with HQL or Criteria query and also this doesn't matter here) part of Hibernate to fetch the data and simply writing it using any of CSVWriter api (this doesn't matter here.) As far all is well, but the problem starts now. Requirements: The report size can be with 5000K to 15000K records with 25 fields . It can be run on real time. There is one report column (let's say finalValue ) for which I want sorting and it can be extract like this, (sum(b.quantity*c.unit_gross_price) - COALESCE(sum(pai.value),0)) . Problem: MySQL Indexing can not be used for finalValue column (mentioned above) as it is complex combination of aggregate functions. So if execute the query (with or without limit) with sorting, it is taking 40sec , otherwise 0.075sec . The Solutions: These are the some solutions, that I can think but each have some limitations. Sorting using java.util.TreeSet : It will throw the OutOfMemoryError , which is obvious as heap space will be exceed if I will put 15000K heavy objects. Using limit in MySQL query and write file for each iteration : It will take much time as every query will take same time around 50sec as without sorting limit can't be use. So the main problem here is to overcome two parameters : Memory and Time . I need to balance both of them. Any ideas, suggestions? NOTE: I am not given here any snaps of code that doesn't mean question details is not enough. Code doe's not require here.

Sorting funcationality Optimization using MySQL and Java

Reputation: 142296

Summary tables. (Let's see more details to be sure this is Data Warehouse type data.) Summary tables are augmented periodically with subtotals and counts. Then when the report is needed, the data is readily available almost directly from the summary table, rather than scanning lots of raw data and doing aggregates.

My blog on Summary Tables. Let's see your schema and report query; we can discuss this in more detail.

Upvotes: 0

acesargl

Reputation: 569

Use a temporary table to store your select result with an index on finalValue. This will store and index your intermediate result.

CREATE TEMPORARY TABLE my_temp_table (INDEX my_index_name (finalValue))
  SELECT ... -- your select

Note that complex expressions will require an alias in your SELECT to be used as a part of a CREATE TABLE SELECT. I assume that your SELECT has the alias finalValue (the column you mentioned).

Then select the temporary table ordered by the finalValue (the index will be used).

SELECT * FROM my_temp_table ORDER BY finalValue;

And finally drop the temporary table (or reuse it if you want, but remember that when client session terminates temporary data is automatically deleted).

Upvotes: 0

Bhashit Parikh

Reputation: 3131

I think you can use a streaming ResultSet here. As documeted on this page under the ResultSet section.

Here are the main points from the documentation.

By default, ResultSets are completely retrieved and stored in memory. In most cases this is the most efficient way to operate and, due to the design of the MySQL network protocol, is easier to implement. If you are working with ResultSets that have a large number of rows or large values and cannot allocate heap space in your JVM for the memory required, you can tell the driver to stream the results back one row at a time.

To enable this functionality, create a Statement instance in the following manner:

stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
          java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);

The combination of a forward-only, read-only result set, with a fetch size of Integer.MIN_VALUE serves as a signal to the driver to stream result sets row-by-row. After this, any result sets created with the statement will be retrieved row-by-row.

There are some caveats with this approach. You must read all of the rows in the result set (or close it) before you can issue any other queries on the connection, or an exception will be thrown.

The earliest the locks these statements hold can be released (whether they be MyISAM table-level locks or row-level locks in some other storage engine such as InnoDB) is when the statement completes.

If using streaming results, process them as quickly as possible if you want to maintain concurrent access to the tables referenced by the statement producing the result set.

So, with a streaming result-set, write your order by query, and then start writing the results into your CSV file.

This still probably doesn't solve the sorting issue, but I think if you can't pre-generate that value and put an index on it, the sorting is going to take some time.

However, there might be some server config variables that you can use to optimize the sorting performance.

From the MySQL Order-By optimization page

I think you can set the read_rnd_buffer_size value, which, according to the docs, can:

Setting the variable to a large value can improve ORDER BY performance by a lot

Another one is sort_buffer_size, for which, the docs say the follwing:

If you see many Sort_merge_passes per second in SHOW GLOBAL STATUS output, you can consider increasing the sort_buffer_size value to speed up ORDER BY or GROUP BY operations that cannot be improved with query optimization or improved indexing.

Another variable that can probably help is the innodb_buffer_pool_size. Which allows innodb to keep as much table data in memory as possible and avoid some disk-seeks.

However, all of these variables require some tuning. Some trial-and-error and probably some kind of benchmarking to get right.

There are some other suggestions on that MySQL Order-By optimization page as well.

Upvotes: 1

Sorting funcationality Optimization using MySQL and Java

Answers (3)

Related Questions