Reputation: 77
I'm working for well known company in a project that should bring integration with other system that are producing one csv per hour of 27Gb. The target is query these files without import em (the main problem is bureaucracy, nobody want resposibility if some data change).
Main filters on this files can be done by dates, the end-user must insert a range start-end dates. After that can be filter by few strings.
For performance improvement i'm indexing files by date writing on each file name the range that contains and making a folder structure like yyyy/mm/dd. For each of following test the first step was make a raw file paths list that will be read.
research will read all files
The dirtiest is also fastes. I want avoid it because security department warned me about all checks to make on input data to prevent shell injection.
Googling i found mariadb CONNECT
engine that can point also huge csvs, so now i'm going on this way creating temporary table with files that research have interest, the bad part is i have to do one table for each query since dates can be different.
For first year We're expecting not more than 5 parallel researches in same time, with an average of 3 weeks of range. This queries will be done asyncronousely.
Do you know something that can help me on it? Not only for the speed but a good practice to apply. Thanks a lot folks.
Upvotes: 0
Views: 1253
Reputation: 718788
To answer your question:
No. There are no best practices. And, AFAIK there are no generally applicable "good" practices.
But I do have some general advice. If you allow considerations such as bureaucracy and (to a lesser extent) security edicts to dictate your technical solutions, then you are liable to end up with substandard solutions; i.e. solutions that are slow or costly to run and keep running. (If "they" want it to be fast, then "they" shouldn't put impediments in your way.)
I don't think we can give you an easy solution to your problem, but I can say some things about your analysis.
You said about the grep
solution.
"I want avoid it because security department warned me about all checks to make on input data to prevent shell injection."
The solution to that concern simple: don't use an intermediate shell. The dangerous injection attacks will be via shell trickery rather than grep
. Java's ProcessBuilder
doesn't use a shell unless you explicitly use one. The grep
program itself can only read the files that are specified in its arguments, and write to standard output and standard error.
You said about the general architecture:
"The target is query these files without import them (the main problem is bureaucracy, nobody want responsibility if some data change)."
I don't understand the objection here. We know that the CSV files are going to change. You are getting a new 27GB CSV file every hour!
If the objection is that the format of the CSV files is going to change, well that affects your ability to effectively query them. But with a little ingenuity, you could detect the the change in format and adjust the ingestion process on the fly.
"We're expecting not more than 5 parallel researches in same time, with an average of 3 weeks of range."
If you haven't done this already, you need to do some analysis to see whether your proposed solution is going to be viable. Estimate how much CSV data needs to be scanned to satisfy a typical query. Multiply that by the number of queries to be performed in (say) 24 hours. Then compare that against your NFS server's ability to satisfy bulk reads. Then redo the calculation assuming a given number of queries running in parallel.
Consider what happens if your (above) expectations are wrong. You only need a couple of "idiot" users doing unreasonable things ...
Having a 24 core server for doing the queries is one thing, but the NFS server also needs to be able to supply the data fast enough. You can improve things with NFS tuning (e.g. by tuning block sizes, the number of NFS daemons, using FS-Cache) but the the ultimate bottlenecks will be getting the data off the NFS server's disks and across the network to your server. Bear in mind that there could be other servers "hammering" the NFS server while your application is doing its thing.
Upvotes: 2