user1142292
user1142292

Reputation: 111

Reading and processing big text file of 25GB

I have to read a big text file of, say, 25 GB and need to process this file within 15-20 minutes. This file will have multiple header and footer section.

I tried CSplit to split this file based on header, but it is taking around 24 to 25 min to split it to a number of files based on header, which is not acceptable at all.

I tried sequential reading and writing by using BufferReader and BufferWiter along with FileReader and FileWriter. It is taking more than 27 min. Again, it is not acceptable.

I tried another approach like get the start index of each header and then run multiple threads to read file from specific location by using RandomAccessFile. But no luck on this.

How can I achieve my requirement?

Possible duplicate of:

Read large files in Java

Upvotes: 11

Views: 42656

Answers (4)

xikkub
xikkub

Reputation: 1660

Try using a large buffer read size (for example, 20MB instead of 2MB) to process your data quicker. Also don't use a BufferedReader because of slow speeds and character conversions.

This question has been asked before: Read large files in Java

Upvotes: 10

0xCAFEBABE
0xCAFEBABE

Reputation: 5666

If the platform is right, you might want to shell out and call a combination of cat and sed. If it is not, you might still want to shell out and use perl via command line. For the case that is absolutely has to be Java doing the actual processing, the others have provided sufficient answers.

Be on your guard though, shelling out is not without problems. But perl or sed might be the only widely available tools to crawl through and alter 25GB of text in your timeframe.

Upvotes: 1

Peter Lawrey
Peter Lawrey

Reputation: 533820

You need to ensure that the IO is fast enough without your processing because I suspect the processing, not the IO is slowing you down. You should be able to get 80 MB/s from a hard drive and up to 400 MB/s from an SSD drive. This means you could read the entire in one second.

Try the following, which is not the fastest, but the simplest.

long start = System.nanoTime();
byte[] bytes = new byte[32*1024];
FileInputStream fis = new FileInputStream(fileName);
int len;
while((len = fis.read(bytes)) > 0);
long time = System.nanoTime() - start;
System.out.printf("Took %.3f seconds%n", time/1e9);

Unless you find you are getting at least 50 MB/s you have a hardware problem.

Upvotes: 7

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77495

Try using java.nio to make better use of the operating systems functionality. Avoid copying the data (e.g. into a string), but try to work with offsets. I believe the java.nio classes will even have methods to transfer data from one buffer to another without pulling the data into the java layer at all (at least on linux), but that will essentially translate into operating system calls.

For many modern web servers this technique has been key to the performance they can serve static data with: essentially they delegate as much as possible to the operating system to avoid duplicating it to the main memory.

Let me emphasizes this: just seeking through a 25 GB byte buffer is a lot faster than converting it into Java Strings (which may require charset encoding/decoding - and copying). Anything that saves you copies and memory management will help.

Upvotes: 1

Related Questions