Reputation: 1954
Currently i have a parser setup that will parse through csv files of ~2 million records. Then I apply some filtering algorithms to weed out the records I want to include/exclude. Finally writing everything back to a new csv file.
I have done some benchmarking and it turns out that writing data to the csv is very expensive and causes massive slowdowns when filtering and appending to a file at the same time. I was wondering if i could perform all my filtering, placing the lines to be written in a queue then have a second process perform all the writing on its own when that queue is full or all filtering is complete.
So basically to summarize:
Read line
Decide whether to discard or keep
if I'm keeping the file, add it to the "Write Queue"
Check if the write queue is full, if so, start the new process that will begin writing
Continue filtering until completed
Thanks for all your help!
EDIT: The way im writing is the following:
FileWriter fw = new FileWriter("myFile.csv");
BufferedWriter bw = new BufferedWriter(fw);
while(read file...) {
//perform filters etc...
try {
bw.write(data.trim());
bw.newLine();
}catch(IOException e) {
System.out.println(e.getMessage());
}
Upvotes: 1
Views: 1113
Reputation: 1
You may want to consider using Spring Batch unless you have any constraints on using Spring.
Upvotes: 0
Reputation: 36532
The read and write processes are both I/O bound (seeking to sectors on disk and performing disk I/O to/from memory) while the filtering process is entirely CPU bound. This is a good candidate for multithreading.
I would use three threads: reading, filtering, and writing. This calls for two queues, but there's no reason to wait for the queues to become full before processing.
Make sure to use buffered readers and writers to minimize contention between the reader and writer threads. You want to minimize disk seeking since that will be the bottleneck, assuming the filtering process is fairly simple.
Upvotes: 3