Reputation: 523
I'm trying to read a large text file as fast as possible.
So this is my code
BufferedReader br = new BufferedReader(new FileReader("C:\\Users\\Documents\\ais_messages1.3.txt"));
String line, aisLines="", cvsSplitBy = ",";
try {
while ((line = br.readLine()) != null) {
if(line.charAt(0) == '!') {
String[] cols = line.split(cvsSplitBy);
if(cols.length>=8) {
line = "";
for(int i=0; i<cols.length-1; i++) {
if(i == cols.length-2) {
line = line + cols[i];
} else {
line = line + cols[i] + ",";
}
}
aisLines += line + "\n";
} else {
aisLines += line + "\n";
}
}
}
} catch (IOException e) {
e.printStackTrace();
}
So right now it reads 36890 rows in 14 seconds. I also tried an InputStreamReader:
InputStreamReader isr = new InputStreamReader(new FileInputStream("C:\\Users\\Documents\\ais_messages1.3.txt"));
BufferedReader br = new BufferedReader(isr);
and it took the same amount of time. Is there a faster way to read a large text file (100,000 or 1,000,000 rows) ?
Upvotes: 1
Views: 2672
Reputation: 1068
As the most consuming operation is IO the most efficient way is to split threads for parsing and reading:
private static void readFast(String filePath) throws IOException, InterruptedException {
ExecutorService executor = Executors.newWorkStealingPool();
BufferedReader br = new BufferedReader(new FileReader(filePath));
List<String> parsed = Collections.synchronizedList(new ArrayList<>());
try {
String line;
while ((line = br.readLine()) != null) {
final String l = line;
executor.submit(() -> {
if (l.charAt(0) == '!') {
parsed.add(parse(l));
}
});
}
} catch (IOException e) {
e.printStackTrace();
}
executor.shutdown();
executor.awaitTermination(1000, TimeUnit.MINUTES);
String result = parsed.stream().collect(Collectors.joining("\n"));
}
For my pc it has taken 386ms vs 10787ms with the slow one
Upvotes: 1
Reputation: 31885
You can use single thread reads your large csv file and multiple threads parse all lines. The way I do is using Producer-Consumer
pattern and BlockingQueue.
Producer
Making one Producer Thread which is only responsible for reading the lines of your csv file, and stores lines into BlockingQueue. The producer side does not do anything else.
Consumers
Making multiple Consumer Threads, pass the same BlockingQueue object into your consumers. Implementing time consuming work in your Consumer Thread class.
The following code provide you an idea of solving problem, not the solution. I was implemented this using python and it works much faster than using a single thread do everything. The language is not java, but the theory behind is the same.
import multiprocessing
import Queue
QUEUE_SIZE = 2000
def produce(file_queue, row_queue,):
while not file_queue.empty():
src_file = file_queue.get()
zip_reader = gzip.open(src_file, 'rb')
try:
csv_reader = csv.reader(zip_reader, delimiter=SDP_DELIMITER)
for row in csv_reader:
new_row = process_sdp_row(row)
if new_row:
row_queue.put(new_row)
finally:
zip_reader.close()
def consume(row_queue):
'''processes all rows, once queue is empty, break the infinit loop'''
while True:
try:
# takes a row from queue and process it
pass
except multiprocessing.TimeoutError as toe:
print "timeout, all rows have been processed, quit."
break
except Queue.Empty:
print "all rows have been processed, quit."
break
except Exception as e:
print "critical error"
print e
break
def main(args):
file_queue = multiprocessing.Queue()
row_queue = multiprocessing.Queue(QUEUE_SIZE)
file_queue.put(file1)
file_queue.put(file2)
file_queue.put(file3)
# starts 3 producers
for i in xrange(4):
producer = multiprocessing.Process(target=produce,args=(file_queue,row_queue))
producer.start()
# starts 1 consumer
consumer = multiprocessing.Process(target=consume,args=(row_queue,))
consumer.start()
# blocks main thread until consumer process finished
consumer.join()
# prints statistics results after consumer is done
sys.exit(0)
if __name__ == "__main__":
main(sys.argv[1:])
Upvotes: 0
Reputation: 8783
Stop trying to build up aisLines
as a big String. Use an ArrayList<String>
that you append the lines on to. That takes 0.6% the time as your method on my machine. (This code processes 1,000,000 simple lines in 0.75 seconds.) And it will reduce the effort needed to process the data later, as it'll already be split up by lines.
BufferedReader br = new BufferedReader(new FileReader("data.txt"));
List<String> aisLines = new ArrayList<String>();
String line, cvsSplitBy = ",";
try {
while ((line = br.readLine()) != null) {
if(line.charAt(0) == '!') {
String[] cols = line.split(cvsSplitBy);
if(cols.length>=8) {
line = "";
for(int i=0; i<cols.length-1; i++) {
if(i == cols.length-2) {
line = line + cols[i];
} else {
line = line + cols[i] + ",";
}
}
aisLines.add(line);
} else {
aisLines.add(line);
}
}
}
} catch (Exception e) {
e.printStackTrace();
}
If you really want a big String
at the end (because you're interfacing with someone else's code, or whatever), it'll still be faster to convert the ArrayList
back into a single string, than to do what you were doing.
Upvotes: 3