guy_sensei
guy_sensei

Reputation: 523

Reading a large text file faster

I'm trying to read a large text file as fast as possible.

So this is my code

BufferedReader br = new BufferedReader(new FileReader("C:\\Users\\Documents\\ais_messages1.3.txt")); 
String line, aisLines="", cvsSplitBy = ",";
try {
   while ((line = br.readLine()) != null) {
      if(line.charAt(0) == '!') {
         String[] cols = line.split(cvsSplitBy);
         if(cols.length>=8) {
            line = ""; 
            for(int i=0; i<cols.length-1; i++) {
               if(i == cols.length-2) {
                  line = line + cols[i]; 
               } else {
                  line = line + cols[i] + ","; 
               } 
            }
            aisLines += line + "\n";
         } else {
            aisLines += line + "\n"; 
         }
      }
   }
} catch (IOException e) {
   e.printStackTrace();
}

So right now it reads 36890 rows in 14 seconds. I also tried an InputStreamReader:

InputStreamReader isr = new InputStreamReader(new FileInputStream("C:\\Users\\Documents\\ais_messages1.3.txt"));
    BufferedReader br = new BufferedReader(isr);

and it took the same amount of time. Is there a faster way to read a large text file (100,000 or 1,000,000 rows) ?

Upvotes: 1

Views: 2672

Answers (3)

nesteant
nesteant

Reputation: 1068

As the most consuming operation is IO the most efficient way is to split threads for parsing and reading:

   private static void readFast(String filePath) throws IOException, InterruptedException {
    ExecutorService executor = Executors.newWorkStealingPool();
    BufferedReader br = new BufferedReader(new FileReader(filePath));
    List<String> parsed = Collections.synchronizedList(new ArrayList<>());
    try {
        String line;
        while ((line = br.readLine()) != null) {
            final String l = line;
            executor.submit(() -> {
                if (l.charAt(0) == '!') {
                    parsed.add(parse(l));
                }
            });
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
    executor.shutdown();
    executor.awaitTermination(1000, TimeUnit.MINUTES);


    String result = parsed.stream().collect(Collectors.joining("\n"));
}

For my pc it has taken 386ms vs 10787ms with the slow one

Upvotes: 1

Haifeng Zhang
Haifeng Zhang

Reputation: 31885

You can use single thread reads your large csv file and multiple threads parse all lines. The way I do is using Producer-Consumer pattern and BlockingQueue.

Producer

Making one Producer Thread which is only responsible for reading the lines of your csv file, and stores lines into BlockingQueue. The producer side does not do anything else.

Consumers

Making multiple Consumer Threads, pass the same BlockingQueue object into your consumers. Implementing time consuming work in your Consumer Thread class.

The following code provide you an idea of solving problem, not the solution. I was implemented this using python and it works much faster than using a single thread do everything. The language is not java, but the theory behind is the same.

import multiprocessing
import Queue

QUEUE_SIZE = 2000


def produce(file_queue, row_queue,):

    while not file_queue.empty():
        src_file = file_queue.get()
        zip_reader = gzip.open(src_file, 'rb')

        try:
            csv_reader = csv.reader(zip_reader, delimiter=SDP_DELIMITER)

            for row in csv_reader:
                new_row = process_sdp_row(row)
                if new_row:
                    row_queue.put(new_row)
        finally:
            zip_reader.close()


def consume(row_queue):
    '''processes all rows, once queue is empty, break the infinit loop'''
    while True:
        try:
            # takes a row from queue and process it
            pass
        except multiprocessing.TimeoutError as toe:
            print "timeout, all rows have been processed, quit."
            break
        except Queue.Empty:
            print "all rows have been processed, quit."
            break
        except Exception as e:
            print "critical error"
            print e
            break


def main(args):

    file_queue = multiprocessing.Queue()
    row_queue = multiprocessing.Queue(QUEUE_SIZE)

    file_queue.put(file1)
    file_queue.put(file2)
    file_queue.put(file3)

    # starts 3 producers
    for i in xrange(4):
        producer = multiprocessing.Process(target=produce,args=(file_queue,row_queue))
        producer.start()

    # starts 1 consumer
    consumer = multiprocessing.Process(target=consume,args=(row_queue,))
    consumer.start()

    # blocks main thread until consumer process finished
    consumer.join()

    # prints statistics results after consumer is done

    sys.exit(0)


if __name__ == "__main__":
    main(sys.argv[1:])

Upvotes: 0

Jay Kominek
Jay Kominek

Reputation: 8783

Stop trying to build up aisLines as a big String. Use an ArrayList<String> that you append the lines on to. That takes 0.6% the time as your method on my machine. (This code processes 1,000,000 simple lines in 0.75 seconds.) And it will reduce the effort needed to process the data later, as it'll already be split up by lines.

BufferedReader br = new BufferedReader(new FileReader("data.txt"));
List<String> aisLines = new ArrayList<String>();
String line, cvsSplitBy = ",";
try {
    while ((line = br.readLine()) != null) {
        if(line.charAt(0) == '!') {
            String[] cols = line.split(cvsSplitBy);
            if(cols.length>=8) {
                line = "";
                for(int i=0; i<cols.length-1; i++) {
                    if(i == cols.length-2) {
                        line = line + cols[i];
                    } else {
                        line = line + cols[i] + ",";
                    }
                }
                aisLines.add(line);
            } else {
                aisLines.add(line);
            }
        }
    }
} catch (Exception e) {
    e.printStackTrace();
}

If you really want a big String at the end (because you're interfacing with someone else's code, or whatever), it'll still be faster to convert the ArrayList back into a single string, than to do what you were doing.

Upvotes: 3

Related Questions