Reputation: 3168
I want to access a large file (file size may vary from 30 MB to 1 GB) through 10 threads and then process each line in the file and write them to another file through 10 threads. If I use only one thread to access the IO, the other threads are blocked. The processing takes some time almost equivalent to reading a line of code from file system. There is one more constraint, the data in the output file should be in the same order as that of the input file.
I want your thoughts on the design of this system. Is there any existing API to support concurrent access to files?
Also writing to same file may lead to deadlock.
Please suggest how to achieve this if I am concerned with time constraint.
Upvotes: 20
Views: 31419
Reputation: 5055
The class shouldn't dispatch strings, it should wrap them in a Line
class that contains meta information, e. g. The line number, since you want to keep the original sequence.
You need a processing class, that does the actual work on the collected data. In your case there is no work to do. The class just stores the information, you can extend it someday to do additional stuff (E.g. reverse the string. Append some other strings, ...)
Then you need a merger class, that does some kind of multiway merge sort on the processing threads and collects all the references to the Line
instances in sequence.
The merger class could also write the data back to a file, but to keep the code clean...
Of course you need much memory for this approach, if you are short on main memory. You'd need a stream based approach that kind of works inplace to keep the memory overhead small.
UPDATE Stream-based approach
Everthing stays the same except:
The Reader
thread pumps the read data into a Balloon
. This balloon has a certain number of Line
instances it can hold (The bigger the number, the more main memory you consume).
The processing threads take Line
s from the balloon, the reader pumps more lines into the balloon as it gets emptier.
The merger class takes the lines from the processing threads as above and the writer writes the data back to a file.
Maybe you should use FileChannel
in the I/O threads, since it's more suited for reading big files and probably consumes less memory while handling the file (but that's just an estimated guess).
Upvotes: 13
Reputation: 527
You can do this using FileChannel in java which allows multiple threads to access the same file. FileChannel allows you to read and write starting from a position. See sample code below:
import java.io.*;
import java.nio.*;
import java.nio.channels.*;
public class OpenFile implements Runnable
{
private FileChannel _channel;
private FileChannel _writeChannel;
private int _startLocation;
private int _size;
public OpenFile(int loc, int sz, FileChannel chnl, FileChannel write)
{
_startLocation = loc;
_size = sz;
_channel = chnl;
_writeChannel = write;
}
public void run()
{
try
{
System.out.println("Reading the channel: " + _startLocation + ":" + _size);
ByteBuffer buff = ByteBuffer.allocate(_size);
if (_startLocation == 0)
Thread.sleep(100);
_channel.read(buff, _startLocation);
ByteBuffer wbuff = ByteBuffer.wrap(buff.array());
int written = _writeChannel.write(wbuff, _startLocation);
System.out.println("Read the channel: " + buff + ":" + new String(buff.array()) + ":Written:" + written);
}
catch (Exception e)
{
e.printStackTrace();
}
}
public static void main(String[] args)
throws Exception
{
FileOutputStream ostr = new FileOutputStream("OutBigFile.dat");
FileInputStream str = new FileInputStream("BigFile.dat");
String b = "Is this written";
//ostr.write(b.getBytes());
FileChannel chnl = str.getChannel();
FileChannel write = ostr.getChannel();
ByteBuffer buff = ByteBuffer.wrap(b.getBytes());
write.write(buff);
Thread t1 = new Thread(new OpenFile(0, 10000, chnl, write));
Thread t2 = new Thread(new OpenFile(10000, 10000, chnl, write));
Thread t3 = new Thread(new OpenFile(20000, 10000, chnl, write));
t1.start();
t2.start();
t3.start();
t1.join();
t2.join();
t3.join();
write.force(false);
str.close();
ostr.close();
}
}
In this sample, there are three threads reading the same file and writing to the same file and do not conflict. This logic in this sample has not taken into consideration that the sizes assigned need not end at a line end etc. You will have find the right logic based on your data.
Upvotes: 3
Reputation: 1181
Be aware that the ideal number of threads is limited by the hardware architecture and other stuffs (you could think about consulting the thread pool to calculate the best number of threads). Assuming that "10" is a good number, we proceed. =)
If you are looking for performance, you could do the following:
Read the file using the threads you have and process each one according to your business rule. Keep one control variable that indicates the next expected line to be inserted on the output file.
If the next expected line is done processing, append it to a buffer (a Queue) (it would be ideal if you could find a way to insert direct in the output file, but you would have lock problems). Otherwise, store this "future" line inside a binary-search-tree, ordering the tree by line position. Binary-search-tree gives you a time complexity of "O(log n)" for searching and inserting, which is really fast for your context. Continue to fill the tree until the next "expected" line is done processing.
Activates the thread that will be responsible to open the output file, consume the buffer periodically and write the lines into the file.
Also, keep track of the "minor" expected node of the BST to be inserted in the file. You can use it to check if the future line is inside the BST before starting searching on it.
This approach uses - O(n) to read the file (but is parallelized) - O(1) to insert the ordered lines into a Queue - O(Logn)*2 to read and write the binary-search-tree - O(n) to write the new file
plus the costs of your business rule and I/O operations.
Hope it helps.
Upvotes: 2
Reputation: 190
Spring Batch comes to mind.
Maintaining the order would require a post process step i.e Store the read index/key ordered in the processing context.The processing logic should store the processed information in context as well.Once processing is done you can then post process the list and write to file.
Beware of OOM issues though.
Upvotes: 1
Reputation: 1154
I have faced similar problem in past. Where i have to read data from single file, process it and write result in other file. Since processing part was very heavy. So i tried to use multiple threads. Here is the design which i followed to solve my problem:
Upvotes: 0
Reputation: 17111
Any sort of IO whether it be disk, network, etc. is generally the bottleneck.
By using multiple threads you are exacerbating the problem as it is very likely only one thread can have access to the IO resource at one time.
It would be best to use one thread to read, pass off info to a worker pool of threads, and then writing directly from there. But again if the workers write to the same place there will be bottlenecks as only one can have the lock. Easily fixed by passing the data to a single writer thread.
In "short":
Single reader thread writes to BlockingQueue or the like, this gives it a natural ordered sequence.
Then worker pool threads wait on the queue for data, recording its sequence number.
Worker threads then write the processed data to another BlockingQueue this time attaching its original sequence number so that
The writer thread can take the data and write it in sequence.
This will likely yield the fastest implementation possible.
Upvotes: 8
Reputation: 3709
Since order need to be maintained, so problem in itself says that reading and writing cannot be done in parallel as it is sequential process, the only thing that you can do in parallel is processing of records but that also doesnt solve much with only one writer.
Here is a design proposal:
Good Luck, hoping you get the best design.
Cheers !!
Upvotes: 0
Reputation: 12817
I would start with three threads.
Of course, I would also ensure that the output file is on a physically different disk than the input file.
If processing tends to be faster slower than the I/O (monitor the queue sizes), you could then start experimenting with two or more parallell "processors" that are synchronized in how they read and write their data.
Upvotes: 15
Reputation: 1263
I have encountered a similar situation before and the way I've handled it is this:
Read the file in the main thread line by line and submit the processing of the line to an executor. A reasonable starting point on ExecutorService is here. If you are planning on using a fixed no of threads, you might be interested in Executors.newFixedThreadPool(10)
factory method in the Executors
class. The javadocs on this topic isn't bad either.
Basically, I'd submit all the jobs, call shutdown and then in the main thread continue to write to the output file in the order for all the Future
that are returned. You can leverage the Future
class' get()
method's blocking nature to ensure order but you really shouldn't use multithreading to write, just like you won't use it to read. Makes sense?
However, 1 GB
data files? If I were you, I'd be first interested in meaningfully breaking down those files.
PS: I've deliberately avoided code in the answer as I'd like the OP to try it himself. Enough pointers to the specific classes, API methods and an example have been provided.
Upvotes: 2
Reputation: 8285
One of the possible ways will be to create a single thread that will read input file and put read lines into a blocking queue. Several threads will wait for data from this queue, process the data.
Another possible solution may be to separate file into chunks and assign each chunk to a separate thread.
To avoid blocking you can use asynchronous IO. You may also take a look at Proactor pattern from Pattern-Oriented Software Architecture Volume 2
Upvotes: 3