Reputation: 17648
In the memory based computing model, the only running time calculations that need to be done can be done abstractly, by considering the data structure.
However , there aren't alot of docs on high performance disk I/o algorithms. Thus I ask the following set of questions:
1) How can we estimate running time of disk I/o operations? I assume there is a simple set of constants which we might add for looking up a value on disk, rather than in memory...
2) And more specifically, what is the difference between performance for accessing a specific index in a file? Is this a constant time operation? Or does it depend on how "far down" the index is?
3) Finally... how does the JVM optimize access of indexed portions of a file?
And... as far as resources -- in general... Are there any good idioms or libraries for on disk data structure implementations?
Upvotes: 2
Views: 1146
Reputation: 6330
Also note that Linux systems, at least, allow different file systems. Depending on the application, one might be a better fit than the others. http://en.wikipedia.org/wiki/File_system#Linux
Upvotes: 1
Reputation: 31813
1) how can we estimate running time of disk I/o operations? I assume there is a simple set of constants which we might add for looking up a value on disk, rather than in memory...
In chapter 6 of Computer Systems: A Programmer's Perspective they give a pretty practical mathematical model for how long it takes to read some data from a typical magnetic disk.
To quote the last page in the linked pdf:
Putting it all together, the total estimated access time is
Taccess = Tavg seek + Tavg rotation + Tavg transfer
= 9 ms + 4 ms + 0.02 ms
= 13.02 ms
This example illustrates some important points:
• The time to access the 512 bytes in a disk sector is dominated by the seek time and the rotational
latency. Accessing the first byte in the sector takes a long time, but the remaining bytes are essentially
free.
• Since the seek time and rotational latency are roughly the same, twice the seek time is a simple and
reasonable rule for estimating disk access time.
*note, the linked pdf is from the authors website == no piracy
Of course, if the data being accessed was recently accessed, there's a decent chance it's cached somewhere in the memory heiarchy, in which case the access time is extremely small(practically, "near instant" when compared to disk access time).
2)And more specifically, what is the difference between performance for accessing a specific index in a file? Is this a constant time operation? Or does it depend on how "far down" the index is?
Another seek + rotation amount of time may occur if the seeked location isnt stored sequentially nearby. It depends where in the file you're seeking, and where that data is physically stored on the disk. For example, fragmented files are guaranteed to cause disk seeks to read the entire file.
Something to keep in mind is that even though you may only request to read a few bytes, the physical reads tend to occur in multiples of a fixed size chunks(the sector size), which ends up in cache. So you may later do a seek to some nearby location in the file, and get lucky that its already in cache for you.
Btw- The full chapter in that book on the memory hierarchy is pure gold, if you're interested in the subject.
Upvotes: 2
Reputation: 718768
1) how can we estimate running time of disk I/o operations? I assume there is a simple set of constants which we might add for looking up a value on disk, rather than in memory...
There are no such universal constants. In fact, performance models of physical disk I/O, file systems and operating systems are too complicated to be able to make accurate predictions for specific operations.
2)And more specifically, what is the difference between performance for accessing a specific index in a file? Is this a constant time operation? Or does it depend on how "far down" the index is?
It is too complicated to predict. For instance, it depends on how much file buffering the OS does, physical disk parameters (e.g. seek times) and how effectively the OS can schedule disk activity ... across all applications.
3)Finally... how does the JVM optimize access of indexed portions of a file?
It doesn't. It is an operating system level thing.
4) are there any good idioms or libraries for on disk data structure implementations?
That is difficult to answer without more details of your actual requirements. But the best idea is not to try and implement this kind of thing yourself. Find an existing library that is a good fit to your requirements.
Upvotes: 1
Reputation: 533492
high performance disk I/o algorithms.
The performance of your hardware is usually so important that what you do in software doesn't matter so much. You should first consider buying the right hardware for the job.
how can we estimate running time of disk I/o operations? I assume there is a simple set of constants which we might add for looking up a value on disk, rather than in memory...
Its simple to time them as they are always going to take many micro-seconds each. For example a HDD can perform 80-120 IOPs and an SSD can perform 80K to 230K IOPs. You can usually get within 1/2 what the manufacturer specifies easily and getting 100% is the where you might do tricks in software. Never the less you will never get a HDD to perform like an SSD unless you have lots of memory and only ever read the data in which case the OS will do all the work for you.
You can buy hybrid drives which give you the capacity of an HDD but performance close to that of an SSD. For commercial production use you may be willing to spend the money of a disk sub-system with multiple drives. This can increase the perform to say 500 IOPS but can cost increases significantly. You usually buy a disk subsytem because you need the capacity and redundancy it provides but you usually get a performance boost as well but having more spinals working together. Although this link on disk subsystem performance is old (2004) they haven't changed that much since then.
And more specifically, what is the difference between performance for accessing a specific index in a file? Is this a constant time operation? Or does it depend on how "far down" the index is?
It depends on whether it is in memory or not. If it is very close to data you recently read it quite likely, if it far away it depends on what accesses you have done in the past and how much memory you have free to cache disk accesses.
The typical latency for a HDD is ~8 ms each (i.e. if you have 10 random reads queued it can be 80 ms) The typical latency of a SSD is 25 to 100 us. It is far less likely that reads will already be queued as it is much faster to start with.
how does the JVM optimize access of indexed portions of a file?
Assuming you are using sensible buffer sizes, there is little you can do about generically in software. What you can do is done by the OS.
are there any good idioms or libraries for on disk data structure implementations?
Use a sensible buffer size like 512 bytes to 64 KB.
Much more importantly, buy the right hardware for your requirements.
Upvotes: 1
Reputation: 29
1) If you need to compare the speed of various IO functions, you have to just run it a thousand times and record how long it takes.
2) That depends on how you plan to get to this index. An index to the beginning of a file is exactly the same as an index to the middle of a file. It just points to a section of memory on the disk. If you get to this index by starting at the beginning and progressing there, then yes it will take longer.
3/4) No these are managed by the operating system itself. Java isn't low level enough to handle these kinds of operations.
Upvotes: 2