Reputation: 19582
I have the following problem:
I have a web application that stores data in the database. I would like for the clients to be able to extract the data e.g. of 2 tables to a file (local to the client).
The database could be arbitrarily big (meaning I have no idea how many data can potentially be in the database. Could be huge).
What is the best approach for this?
Should all the data be SELECT
ed out of the tables and returned to the client as a single structure to be stored in a file?
Or should the data be retrieved in parts e.g. first 100 then next 100 entries etc and create the single structure in the client?
Are there any pros-cons to consider here?
Upvotes: 0
Views: 260
Reputation: 29639
I've built something similar - there are some really awkward problems here, especially as the filesize can grow beyond what you can comfortably handle in a browser. As the amount of data grows, the time to generate the file increases; this in turn is not what a web application is good at, so you run the risk of your web server getting unhappy with even a smallish number of visitors all requesting a large file.
What we did is split the application into 3 parts.
The "file request" was a simple web page, in which authenticated users can request their file. This kicks off the second part outside the context of the web page request:
File generator. In our case, this was a windows service which looked at a database table with file requests, picked the latest one, ran the appropriate SQL query, wrote the output to a CSV file, and ZIPped that file, before moving it to the output directory and mailing the user with a link. It set the state of the record in the database to make sure only one process happened at any one point in time.
FTP/WebDAV site: The ZIP files were written to a folder which was accessible via FTP and WebDAV - these protocols tend to do better with huge files than a standard HTTP download.
This worked pretty well - users didn't like to wait for their files, but the delay was rarely more than a few minutes.
Upvotes: 1
Reputation: 2766
We have a similar use case with an oracle cluster containig approx. 40GB of data. The solution working best for us is a maximum of data per select statement as it reduces DB-overhead significantly.
That being said, there are three optimizations which worked very well for us:
1.) We partition the data into 10 roughly same-sized sets and select them from the database in parallel. For our cluster we found that 8 connections in parallel work approx. 8 times faster than a single connection. There is some additional speedup up to 12 connections but that depends on your database and your dba.
2.) Keep away from hibernate or other ORMs and use custom made JDBCs once you talk about large amounts of data. Use all optimiziations you can get there (e.g. ResultSet.setFetchSize())
3.) Our data compresses very well and putting the data through a gziper saves lots of I/O time. In our case it eliminated I/O from the critical path. By the way, this is also true for storing the data in a file.
Upvotes: 0