Reputation: 15785
I tried to read lines from a file which maybe large.
To make a better performance, I tried to use mapped file. But when I compare the performance, I find that the mapped file way is even a a little slower than I read from BufferedReader
public long chunkMappedFile(String filePath, int trunkSize) throws IOException {
long begin = System.currentTimeMillis();
logger.info("Processing imei file, mapped file [{}], trunk size = {} ", filePath, trunkSize);
//Create file object
File file = new File(filePath);
//Get file channel in readonly mode
FileChannel fileChannel = new RandomAccessFile(file, "r").getChannel();
long positionStart = 0;
StringBuilder line = new StringBuilder();
long lineCnt = 0;
while(positionStart < fileChannel.size()) {
long mapSize = positionStart + trunkSize < fileChannel.size() ? trunkSize : fileChannel.size() - positionStart ;
MappedByteBuffer buffer = fileChannel.map(FileChannel.MapMode.READ_ONLY, positionStart, mapSize);//mapped read
for (int i = 0; i < buffer.limit(); i++) {
char c = (char) buffer.get();
//System.out.print(c); //Print the content of file
if ('\n' != c) {
line.append(c);
} else {// line ends
processor.processLine(line.toString());
if (++lineCnt % 100000 ==0) {
try {
logger.info("mappedfile processed {} lines already, sleep 1ms", lineCnt);
Thread.sleep(1);
} catch (InterruptedException e) {}
}
line = new StringBuilder();
}
}
closeDirectBuffer(buffer);
positionStart = positionStart + buffer.limit();
}
long end = System.currentTimeMillis();
logger.info("chunkMappedFile {} , trunkSize: {}, cost : {} " ,filePath, trunkSize, end - begin);
return lineCnt;
}
public long normalFileRead(String filePath) throws IOException {
long begin = System.currentTimeMillis();
logger.info("Processing imei file, Normal read file [{}] ", filePath);
long lineCnt = 0;
try (BufferedReader br = new BufferedReader(new FileReader(filePath))) {
String line;
while ((line = br.readLine()) != null) {
processor.processLine(line.toString());
if (++lineCnt % 100000 ==0) {
try {
logger.info("file processed {} lines already, sleep 1ms", lineCnt);
Thread.sleep(1);
} catch (InterruptedException e) {}
} }
}
long end = System.currentTimeMillis();
logger.info("normalFileRead {} , cost : {} " ,filePath, end - begin);
return lineCnt;
}
Test result in Linux with reading a file which size is 537MB:
MappedBuffer way:
2017-09-28 14:33:19.277 [main] INFO com.oppo.push.ts.dispatcher.imei2device.ImeiTransformerOfflineImpl - process imei file ends:/push/file/imei2device-local/20170928/imei2device-13 , lines :12758858 , cost :14804 , lines per seconds: 861852.0670089165
BufferedReader way:
2017-09-28 14:27:03.374 [main] INFO com.oppo.push.ts.dispatcher.imei2device.ImeiTransformerOfflineImpl - process imei file ends:/push/file/imei2device-local/20170928/imei2device-13 , lines :12758858 , cost :13001 , lines per seconds: 981375.1249903854
Upvotes: 1
Views: 2023
Reputation: 140613
That is the thing: file IO isn't straight forward and easy.
You have to keep in mind that your operating system has a huge impact on what exactly is going to happen. In that sense: there are no solid rules that would work for all JVM implementations on all platforms.
When you really have to worry about the last bit of performance, doing in-depth profiling on your target platform is the primary solution.
Beyond that, you are getting that "performance" aspect wrong. Meaning: memory mapped IO doesn't magically increase the performance of reading a single file within an application once. Its major advantages go along this path:
mmap is great if you have multiple processes accessing data in a read only fashion from the same file, which is common in the kind of server systems I write. mmap allows all those processes to share the same physical memory pages, saving a lot of memory.
( quoted from this answer on using the C mmap()
system call )
In other words: you example is about reading a file contents. In the end, the OS still has to turn to the drive to read all bytes from there. Meaning: it reads disc content and puts it in memory. When you do that the first time ... it really doesn't matter that you do some "special" things on top of that. To the contrary - as you do "special" things the memory-mapped approach might even be slower - because of the overhead compared to an "ordinary" read.
And coming back to my first record: even when you would have 5 process reading the same file, the memory-mapped approach isn't necessarily faster. As the Linux might figure: I already read that file into memory, and it didn't change - so even without explicit "memory mapping" the Linux kernel might cache information.
Upvotes: 5
Reputation: 719446
GhostCat is correct. And in addition to your OS choice, other things that can affect performance.
Mapping a file will place greater demand on physical memory. If physical memory is "tight" that could cause paging activity, and a performance hit.
The OS could use a different read-ahead strategy if you read a file using read
syscalls versus mapping it into memory. Read-ahead (into the buffer cache) can make file reading a lot faster.
The default buffer size for BufferedReader
and the OS memory page size are likely to be different. This may result in the size of disk read requests being different. (Larger reads often result in greater throughput I/O. At least to a certain point.)
There could also be "artefacts" caused by the way that you benchmark. For example:
read
time will be shorter.Upvotes: 2
Reputation: 73568
The memory mapping doesn't really give any advantage, since even though you're bulk loading a file into memory, you're still processing it one byte at a time. You might see a performance increase if you processed the buffer in suitably sized byte[]
chunks. Even then the BufferedReader
version may perform better or at least almost the same.
The nature of your task is to process a file sequentially. BufferedReader
already does this very well and the code is simple, so if I had to choose I'd go with the simplest option.
Also note that your buffer code doesn't work except for single byte encodings. As soon as you get multiple bytes per character, it will fail magnificently.
Upvotes: 3