meow
meow

Reputation: 41

Splitting text file into chunks in java using multithread

I have split a text file (50GB) based on the formula (total size of file/split size).Now the splitting is done in single thread sequentially, how can i change this code to perform the splitting in multithread (ie parallely the thread should split the file and store in folder) I dont want to read the file as it would utilize more cpu. My main goal is I have to reduce the cpu utilization and complete the splitting of the file quickly with less amount of time. I have 8 cpu cores.

Any suggestions ?? Thanks in advance.

public class ExecMap {


public static void main(String[] args) throws InterruptedException, ExecutionException, TimeoutException {

    String FilePath = "/home/xm/Downloads/wikipedia_50GB/wikipedia_50GB/file21";
    File file = new File(FilePath);
    long splitFileSize = 64 * 1024 * 1024;
    long fileSize = file.length();
    System.out.println(+fileSize);
    int mappers = (int) (fileSize / splitFileSize);
    System.out.println(+mappers);
    ExecMap exec= new ExecMap();
    exec.mapSplit(FilePath,splitFileSize,mappers,fileSize);
}

private static void mapSplit(String FilePath, long splitlen, int mappers,long fileSize) {
ExecutorService executor = Executors.newFixedThreadPool(1);
        executor.submit(() -> {
            long len = fileSize;
            long leninfile = 0, leng = 0;
            int count = 1, data;
            try {
                long startTime = System.currentTimeMillis(); // Get the start Time
                long endTime = 0;
                System.out.println(startTime);
                File filename = new File(FilePath);
                InputStream infile = new BufferedInputStream(new FileInputStream(filename));
                data = infile.read();
                while (data != -1) {

                    String name = Thread.currentThread().getName();
                    System.out.println("task started: " + name +" ====Time " +System.currentTimeMillis());
                    filename = new File("/home/xm/Desktop/split/" +"Mapper " + count + ".txt");
                    OutputStream outfile = new BufferedOutputStream(new FileOutputStream(filename));
                    while (data != -1 && leng < splitlen) {
                        outfile.write(data);
                        leng++;
                        data = infile.read();
                    }
                    leninfile += leng;
                    leng = 0;
                    outfile.close();
                    count++;
                    System.out.println("task finished: " + name);
                }
                endTime = System.currentTimeMillis();
                System.out.println(endTime);
                long msec = endTime - startTime;
                long sec = endTime - startTime;
                System.out.println("Difference in milli seconds: " + msec); //Print the difference in mili seconds
                System.out.println("Differencce in Seconds: " + sec / 1000);


            } catch (FileNotFoundException e) {
                e.printStackTrace();
            } catch (IOException e) {
                e.printStackTrace();
            }
            executor.shutdownNow();
        });


}
}

Upvotes: 4

Views: 3677

Answers (3)

Richard Chambers
Richard Chambers

Reputation: 17593

The basic multi-threaded approach is to take a task, divide it into sub-tasks which can be done as individual unit of work, and create a thread for each sub-task. This works best when the threads can be independent of each other and do not require any kind of communication and are not sharing any resources.

House building as an analogy

So if we are building a house, some sub-tasks need to be done in a particular order. A foundation must exist before the house can be built. The walls need to be in place before the roof can be put on.

However some sub-tasks can be done independently. The roof can be shingled while the plumbers are installing the plumbing and the brick layers are bricking up the outside of the house.

Basic thoughts on the problem to be solved

In the case of your file splitting, the basic approach would be to take the task, splitting the file, and divide this into several sub-tasks, assign a portion of the file to be split to each thread.

However this particular task, splitting the file, has a common piece of work that is going to create a bottleneck and may require some kind of synchronization between the threads when they are reading from the original file to be split. Having several threads accessing the same file requires that the file access be done in a manner that the threads can access their assigned portion of the file.

The good news is that since the only thing being shared is the original file and it is only being read from, you do not need to worry about synchronizing the file reads at the Java level.

A first approach

The approach I would consider at first is to divide the number of output files by the number of threads. Each thread would then open the file with its own file reader so that each thread is independent of the other threads with its file I/O. This way though the original file is shared, each thread has its own data about file read position so each thread is reading from the file independently.

Each thread would then create its own set of output files and read from the original file and write to the output file. The threads would create their output files one at a time beginning with their assigned offset within the original file, reading from the original file and writing to the output file.

Doing it this way you will maintain the independency of work of each thread. Each thread has its own original file access data. Each thread has its own assigned region of the original file. Each thread generates their own set of output files.

Other considerations

At the operating system level the file system is shared. So the operating system will need to interleave and multiplex the file system access. For an application such as this where data is being read from a disk file and then immediately written back to another disk file, most of the time the application is waiting for the operating system to perform the I/O operation requested.

For a disk file there are several lower level operations that need to be performed such as: (1) finding the location of the file on the disk, (2) seeking to that location, and (3) reading or writing the requested amount of data. The operating system does all of these things for the application and while the operating system is doing these actions the application waits.

So in your multi-threading application, each thread is asking the operating system for disk I/O so each thread will be spending most of its time waiting for the disk I/O request to be done by the operating system, whether that disk I/O is reading from the original file or writing to a new split file.

Since this disk I/O will probably be the bounding action that will require the most time, the question is whether the time taken can be reduced.

A second approach

So an alternative architecture would be to have a single thread which only reads from the original file and reads in large chunks which are several times the size of the split file size. Then one or more other threads are used to take each chunk and create the output split file.

The single thread reading from the original file reads a split file chunk and gives that chunk to another thread. The other thread then creates the split file and writes the chunk out. While the other thread is performing that sub-task, the single thread reads the next chunk from the original file and uses a second thread to write that chunk out to a split file.

This approach should allow the disk I/O to be more efficient in that large chunks of the original file are being read into memory. The disk I/O with the original large file is done sequentially allowing the operating system to do disk seeks and disk I/O more efficiently.

In the first approach accessing the original file is done randomly which requires that the disk heads which read the data from the disk must be repositioned more often as each thread makes a disk read request.

Final thoughts: test predictions by measuring

In order to determine which of these approaches is actually more efficient would require trying both. While you can make a prediction based on a model of how the operating system and the disk hardware works, until you have actually tried it and measured the two approaches, you will not know whether one is superior over another.

And in the end, the most efficient method may be to just have a single thread which reads large chunks of the original file and then writes out the smaller split files.

Possible benefits from multiple threads

On the other hand, if you have several threads which are handed large chunks of the file being split, some of the operating system overhead involved in creating, opening, and closing of files may be scheduled more efficiently with the multiple threads. Using multiple threads may allow the operating system's file management subsystem and the disk I/O routines to schedule the disk I/O more efficiently by choosing among multiple pending disk I/O requests.

Due to the overhead of creating and destroying threads, you would probably want to create a set of working threads at application startup and then assign a particular split file I/O task to the threads. When the thread finishes with that assignment it then waits for another.

Upvotes: 2

tucuxi
tucuxi

Reputation: 17945

You will not see any advantage from launching multiple threads (as noted by many in comments to the original question) to "split a file in parallel".

Having multiple threads working on parts of a large task in parallel can only speed things up if they are acting independently of each other. Since, in this case, the time-consuming part is reading 50 Gb of file and writing it out as smaller files, and this is not done by Java but by the OS (and ultimately, by the disk driver having to read and later write all those bytes), having multiple threads will only add a small overhead (for thread creation & scheduling), making everything a bit slower.

Additionally, sequential reads and writes in rotating disks (SSDs are exempted from this rule) are much faster than random reads and writes - if many threads are reading and writing from different parts of a disk, throughput will be considerably worse than if a single thread does everything.

Think about it this way - you have a truck driver (the OS+disk) and have to split a big heap of bricks at place A into smaller heaps of disks at places C, D and E; and bricks can only travel by truck. There is only that truck driver and you, the supervisor giving the orders. Would hiring more supervisors (threads) to give orders in parallel speed things up?. No - you would just get in the way of each other and the truck-driver, trying to please all of you, would need many more journeys driving smaller amounts of bricks to do the same job.

Upvotes: 1

shachar
shachar

Reputation: 639

You can use RandomAccessFile and use seek to skip to a certain position.

This way you can give your executors a start position and an end position so each executor will work on a small chunk of the file

But as it was mentioned your problem will be Disk I/O

Upvotes: 1

Related Questions