BeyondProgrammer
BeyondProgrammer

Reputation: 923

Java Fastest way to read through text file with 2 million lines

Currently I am using scanner/filereader and using while hasnextline. I think this method is not highly efficient. Is there any other method to read file with the similar functionality of this?

public void Read(String file) {
        Scanner sc = null;


        try {
            sc = new Scanner(new FileReader(file));

            while (sc.hasNextLine()) {
                String text = sc.nextLine();
                String[] file_Array = text.split(" ", 3);

                if (file_Array[0].equalsIgnoreCase("case")) {
                    //do something
                } else if (file_Array[0].equalsIgnoreCase("object")) {
                    //do something
                } else if (file_Array[0].equalsIgnoreCase("classes")) {
                    //do something
                } else if (file_Array[0].equalsIgnoreCase("function")) {
                    //do something
                } 
                else if (file_Array[0].equalsIgnoreCase("ignore")) {
                    //do something
                }
                else if (file_Array[0].equalsIgnoreCase("display")) {
                    //do something
                }
            }

        } catch (FileNotFoundException e) {
            System.out.println("Input file " + file + " not found");
            System.exit(1);
        } finally {
            sc.close();
        }
    }

Upvotes: 42

Views: 129562

Answers (10)

Panagiotis Drakatos
Panagiotis Drakatos

Reputation: 3204

I am wondering why no one mentioned the MappedByteBuffer. I believe it's the most efficient way to read large files until 2GB.

Almost all projects require us to work with files. However, what if the file is excessively large? When heap becomes filled, the JVM generates an OutOfMemoryError as an error. Java offers the MappedByteBuffer (JavaNIO) class, which facilitates the manipulation of sizable files.

The class MappedByteBuffer is responsible for establishing a virtual-memory mapping using JVM memory. The contents of the file are loaded into virtual memory rather than the heap, and the JVM can receive and write data in JVM memory without requiring OS-specific read/write system calls. Additionally, we can map a subset of a file rather than the entire file.

Obtaining FileChannel from MappedByteBuffer enables us to map a file. The FileChannel link enables file manipulation, writing, and reading. FileChannel is accessible via FileOutputStream (for writing) and RandomAccessFile, as well as FileInputStream (for reading only).

To map a file, FileChannel provides the map() method. It requires three arguments.

  1. Mode of the map (PRIVATE, READ_ONLY, and READ_WRITE)

  2. Placement

  3. Size

Once MappedByteBuffer is obtained, the get() and put() methods can be used to receive and write data, respectively.

The file is located in the /resource directory so we can load it using the following function:

Path getFileURIFromResources(String fileName) throws Exception {
    return Paths.get(fileNamePath);
}

This is how we read from MappedBuffer:

CharBuffer charBuffer = null;
Path pathToRead = getFileURIFromResources("fileToRead.txt");

try (FileChannel fileChannel (FileChannel) Files.newByteChannel(
  pathToRead, EnumSet.of(StandardOpenOption.READ))) {
 
    MappedByteBuffer mappedByteBuffer = fileChannel
      .map(FileChannel.MapMode.READ_ONLY, 0, fileChannel.size());

    if (mappedByteBuffer != null) {
        charBuffer = Charset.forName("UTF-8").decode(mappedByteBuffer);
    }
}

This is how we write:

CharBuffer charBuffer = CharBuffer
  .wrap("This will be written to the file");
Path pathToWrite = getFileURIFromResources("fileToWriteTo.txt");

try (FileChannel fileChannel = (FileChannel) Files
  .newByteChannel(pathToWrite, EnumSet.of(
    StandardOpenOption.READ, 
    StandardOpenOption.WRITE, 
    StandardOpenOption.TRUNCATE_EXISTING))) {
    
    MappedByteBuffer mappedByteBuffer = fileChannel
      .map(FileChannel.MapMode.READ_WRITE, 0, charBuffer.length());
    
    if (mappedByteBuffer != null) {
        mappedByteBuffer.put(
          Charset.forName("utf-8").encode(charBuffer));
    }
} 

Upvotes: 1

arviarya
arviarya

Reputation: 660

You can read the file in chunks if there are millions of records. That will avoid potential memory issue. You need to keep last pointer to calculate offset of file.

try (FileReader reader = new FileReader(filePath);
                BufferedReader bufferedReader = new BufferedReader(reader);) {

            int pageOffset = lastOffset + counter;
            int skipRecords = (pageOffset - 1) * batchSize;

            bufferedReader.lines().skip(skipRecords).forEach(cline -> {
                try {
                    // PRINT
                    
                }

Upvotes: 0

YAMM
YAMM

Reputation: 602

I made a gist comparing different methods:

import java.io.*;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.LinkedList;
import java.util.List;
import java.util.Scanner;
import java.util.function.Function;

public class Main {

    public static void main(String[] args) {

        String path = "resources/testfile.txt";
        measureTime("BufferedReader.readLine() into LinkedList", Main::bufferReaderToLinkedList, path);
        measureTime("BufferedReader.readLine() into ArrayList", Main::bufferReaderToArrayList, path);
        measureTime("Files.readAllLines()", Main::readAllLines, path);
        measureTime("Scanner.nextLine() into ArrayList", Main::scannerArrayList, path);
        measureTime("Scanner.nextLine() into LinkedList", Main::scannerLinkedList, path);
        measureTime("RandomAccessFile.readLine() into ArrayList", Main::randomAccessFileArrayList, path);
        measureTime("RandomAccessFile.readLine() into LinkedList", Main::randomAccessFileLinkedList, path);
        System.out.println("-----------------------------------------------------------");
    }

    private static void measureTime(String name, Function<String, List<String>> fn, String path) {
        System.out.println("-----------------------------------------------------------");
        System.out.println("run: " + name);
        long startTime = System.nanoTime();
        List<String> l = fn.apply(path);
        long estimatedTime = System.nanoTime() - startTime;
        System.out.println("lines: " + l.size());
        System.out.println("estimatedTime: " + estimatedTime / 1_000_000_000.);
    }

    private static List<String> bufferReaderToLinkedList(String path) {
        return bufferReaderToList(path, new LinkedList<>());
    }

    private static List<String> bufferReaderToArrayList(String path) {
        return bufferReaderToList(path, new ArrayList<>());
    }

    private static List<String> bufferReaderToList(String path, List<String> list) {
        try {
            final BufferedReader in = new BufferedReader(
                new InputStreamReader(new FileInputStream(path), StandardCharsets.UTF_8));
            String line;
            while ((line = in.readLine()) != null) {
                list.add(line);
            }
            in.close();
        } catch (final IOException e) {
            e.printStackTrace();
        }
        return list;
    }

    private static List<String> readAllLines(String path) {
        try {
            return Files.readAllLines(Paths.get(path));
        } catch (IOException e) {
            e.printStackTrace();
        }
        return null;
    }

    private static List<String> randomAccessFileLinkedList(String path) {
        return randomAccessFile(path, new LinkedList<>());
    }

    private static List<String> randomAccessFileArrayList(String path) {
        return randomAccessFile(path, new ArrayList<>());
    }

    private static List<String> randomAccessFile(String path, List<String> list) {
        try {
            RandomAccessFile file = new RandomAccessFile(path, "r");
            String str;
            while ((str = file.readLine()) != null) {
                list.add(str);
            }
            file.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return list;
    }

    private static List<String> scannerLinkedList(String path) {
        return scanner(path, new LinkedList<>());
    }

    private static List<String> scannerArrayList(String path) {
        return scanner(path, new ArrayList<>());
    }

    private static List<String> scanner(String path, List<String> list) {
        try {
            Scanner scanner = new Scanner(new File(path));
            while (scanner.hasNextLine()) {
                list.add(scanner.nextLine());
            }
            scanner.close();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }
        return list;
    }


}

run: BufferedReader.readLine() into LinkedList, lines: 1000000, estimatedTime: 0.105118655

run: BufferedReader.readLine() into ArrayList, lines: 1000000, estimatedTime: 0.072696934

run: Files.readAllLines(), lines: 1000000, estimatedTime: 0.087753316

run: Scanner.nextLine() into ArrayList, lines: 1000000, estimatedTime: 0.743121734

run: Scanner.nextLine() into LinkedList, lines: 1000000, estimatedTime: 0.867049885

run: RandomAccessFile.readLine() into ArrayList, lines: 1000000, estimatedTime: 11.413323046

run: RandomAccessFile.readLine() into LinkedList, lines: 1000000, estimatedTime: 11.423862897

BufferedReader is the fastest, Files.readAllLines() is also acceptable, Scanner is slow due to regex, RandomAccessFile is inacceptable

Upvotes: 28

Digao
Digao

Reputation: 560

just updating this thread, now we have java 8 to do this job:

List<String> lines = Files.readAllLines(Paths.get(file_path);

Upvotes: 3

shamsAAzad
shamsAAzad

Reputation: 143

Scanner can't be as fast as BufferedReader, as it uses regular expressions for reading text files, which makes it slower compared to BufferedReader. By using BufferedReader you can read a block from a text file.

BufferedReader bf = new BufferedReader(new FileReader("FileName"));

you can next use readLine() to read from bf.

Hope it serves your purpose.

Upvotes: 9

mac7
mac7

Reputation: 166

Use BufferedReader for high performance file access. But the default buffer size of 8192 bytes is often too small. For huge files you can increase the buffer size by orders of magnitudes to boost your file reading performance. For example:

BufferedReader br = new BufferedReader("file.dat", 1000 * 8192);
while ((thisLine = br.readLine()) != null) {
    System.out.println(thisLine);
}  

Upvotes: 3

user207421
user207421

Reputation: 310840

You will find that BufferedReader.readLine() is as fast as you need: you can read millions of lines a second with it. It is more probable that your string splitting and handling is causing whatever performance problems you are encountering.

Upvotes: 46

Pratik Shelar
Pratik Shelar

Reputation: 3214

If you wish to read all lines together then you should have a look at the Files API of java 7. Its really simple to use.

But a better approach would be to process this file in a batch. Have a reader which reads chunks of lines from the file and a writer which does the required processing or persists the data. Having abatch will ensure that it will work even if the lines increase to billion in future. Also you can have a batch which uses a multithreading to increase theoverall performance of the batch. I would recpmmend that you have a look at spring batch.

Upvotes: -2

Trying
Trying

Reputation: 14278

you can use FileChannel and ByteBuffer from JAVA NIO. ByteBuffer size is the most critical part in reading data faster what i have observed. Below code will read the content of the file.

static public void main( String args[] ) throws Exception 
    {
        FileInputStream fileInputStream = new FileInputStream(
                                        new File("sample4.txt"));
        FileChannel fileChannel = fileInputStream.getChannel();
        ByteBuffer byteBuffer = ByteBuffer.allocate(1024);

        fileChannel.read(byteBuffer);
        byteBuffer.flip();
        int limit = byteBuffer.limit();
        while(limit>0)
        {
            System.out.print((char)byteBuffer.get());
            limit--;
        }

        fileChannel.close();
    }

You can check for '\n' for new line here. Thanks.


Even you can scatter and getter way to read files faster i.e.

fileChannel.get(buffers);

where

      ByteBuffer b1 = ByteBuffer.allocate(B1);
      ByteBuffer b2 = ByteBuffer.allocate(B2);
      ByteBuffer b3 = ByteBuffer.allocate(B3);

      ByteBuffer[] buffers = {b1, b2, b3};

This saves the user process to from making several system calls (which can be expensive) and allows kernel to optimize handling of the data because it has information about the total transfer, If multiple CPUs available it may even be possible to fill and drain several buffers simultaneously.

From this book.

Upvotes: 5

lol
lol

Reputation: 3390

You must investigate which part of program is taking time.

As per answer of EJP, you should use BufferedReader.

If really string processing is taking time, then you should consider using threads, one thread will read from file and queues lines. Other string processor threads will dequeue lines and process them. You will need to investigate how many threads to use, the number of threads you should use in application has to be related with number of cores in CPU, in that way will use full CPU.

Upvotes: 0

Related Questions