JaskeyLam
JaskeyLam

Reputation: 15785

Java, why reading from MappedByteBuffer is slower than reading from BufferedReader

I tried to read lines from a file which maybe large.

To make a better performance, I tried to use mapped file. But when I compare the performance, I find that the mapped file way is even a a little slower than I read from BufferedReader

public long chunkMappedFile(String filePath, int trunkSize) throws IOException {
    long begin = System.currentTimeMillis();
    logger.info("Processing imei file, mapped file [{}], trunk size = {} ", filePath, trunkSize);

    //Create file object
    File file = new File(filePath);

    //Get file channel in readonly mode
    FileChannel fileChannel = new RandomAccessFile(file, "r").getChannel();

    long positionStart = 0;
    StringBuilder line = new StringBuilder();
    long lineCnt = 0;
    while(positionStart < fileChannel.size()) {
        long mapSize = positionStart + trunkSize < fileChannel.size() ? trunkSize : fileChannel.size()  - positionStart ;
        MappedByteBuffer buffer = fileChannel.map(FileChannel.MapMode.READ_ONLY, positionStart, mapSize);//mapped read
        for (int i = 0; i < buffer.limit(); i++) {
            char c = (char) buffer.get();
            //System.out.print(c); //Print the content of file
            if ('\n' != c) {
                line.append(c);
            } else {// line ends
                processor.processLine(line.toString());
                if (++lineCnt % 100000 ==0) {
                    try {
                        logger.info("mappedfile processed {} lines already, sleep 1ms", lineCnt);
                        Thread.sleep(1);
                    } catch (InterruptedException e) {}
                }
                line = new StringBuilder();
            }
        }
        closeDirectBuffer(buffer);
        positionStart = positionStart + buffer.limit();
    }

    long end = System.currentTimeMillis();
    logger.info("chunkMappedFile {} , trunkSize: {},  cost : {}  " ,filePath, trunkSize, end - begin);

    return lineCnt;
}

public long normalFileRead(String filePath) throws IOException {
    long begin = System.currentTimeMillis();
    logger.info("Processing imei file, Normal read file [{}] ", filePath);
    long lineCnt = 0;
    try (BufferedReader br = new BufferedReader(new FileReader(filePath))) {
        String line;

        while ((line = br.readLine()) != null) {
            processor.processLine(line.toString());
            if (++lineCnt % 100000 ==0) {
                try {
                    logger.info("file processed {} lines already, sleep 1ms", lineCnt);
                    Thread.sleep(1);
                } catch (InterruptedException e) {}
            }            }
    }
    long end = System.currentTimeMillis();
    logger.info("normalFileRead {} ,   cost : {}  " ,filePath, end - begin);

    return lineCnt;
}

Test result in Linux with reading a file which size is 537MB:

MappedBuffer way:

2017-09-28 14:33:19.277 [main] INFO  com.oppo.push.ts.dispatcher.imei2device.ImeiTransformerOfflineImpl - process imei file ends:/push/file/imei2device-local/20170928/imei2device-13 , lines :12758858 , cost :14804 , lines per seconds: 861852.0670089165

BufferedReader way:

2017-09-28 14:27:03.374 [main] INFO  com.oppo.push.ts.dispatcher.imei2device.ImeiTransformerOfflineImpl - process imei file ends:/push/file/imei2device-local/20170928/imei2device-13 , lines :12758858 , cost :13001 , lines per seconds: 981375.1249903854

Upvotes: 1

Views: 2023

Answers (3)

GhostCat
GhostCat

Reputation: 140613

That is the thing: file IO isn't straight forward and easy.

You have to keep in mind that your operating system has a huge impact on what exactly is going to happen. In that sense: there are no solid rules that would work for all JVM implementations on all platforms.

When you really have to worry about the last bit of performance, doing in-depth profiling on your target platform is the primary solution.

Beyond that, you are getting that "performance" aspect wrong. Meaning: memory mapped IO doesn't magically increase the performance of reading a single file within an application once. Its major advantages go along this path:

mmap is great if you have multiple processes accessing data in a read only fashion from the same file, which is common in the kind of server systems I write. mmap allows all those processes to share the same physical memory pages, saving a lot of memory.

( quoted from this answer on using the C mmap() system call )

In other words: you example is about reading a file contents. In the end, the OS still has to turn to the drive to read all bytes from there. Meaning: it reads disc content and puts it in memory. When you do that the first time ... it really doesn't matter that you do some "special" things on top of that. To the contrary - as you do "special" things the memory-mapped approach might even be slower - because of the overhead compared to an "ordinary" read.

And coming back to my first record: even when you would have 5 process reading the same file, the memory-mapped approach isn't necessarily faster. As the Linux might figure: I already read that file into memory, and it didn't change - so even without explicit "memory mapping" the Linux kernel might cache information.

Upvotes: 5

Stephen C
Stephen C

Reputation: 719446

GhostCat is correct. And in addition to your OS choice, other things that can affect performance.

  • Mapping a file will place greater demand on physical memory. If physical memory is "tight" that could cause paging activity, and a performance hit.

  • The OS could use a different read-ahead strategy if you read a file using read syscalls versus mapping it into memory. Read-ahead (into the buffer cache) can make file reading a lot faster.

  • The default buffer size for BufferedReader and the OS memory page size are likely to be different. This may result in the size of disk read requests being different. (Larger reads often result in greater throughput I/O. At least to a certain point.)

There could also be "artefacts" caused by the way that you benchmark. For example:

  • The first time you read a file, a copy of some or all of the file will land in the buffer cache (in memory)
  • The second time you read the same file, parts of it may still be in memory, and the apparent read time will be shorter.

Upvotes: 2

Kayaman
Kayaman

Reputation: 73568

The memory mapping doesn't really give any advantage, since even though you're bulk loading a file into memory, you're still processing it one byte at a time. You might see a performance increase if you processed the buffer in suitably sized byte[] chunks. Even then the BufferedReader version may perform better or at least almost the same.

The nature of your task is to process a file sequentially. BufferedReader already does this very well and the code is simple, so if I had to choose I'd go with the simplest option.

Also note that your buffer code doesn't work except for single byte encodings. As soon as you get multiple bytes per character, it will fail magnificently.

Upvotes: 3

Related Questions