Reputation: 2476
The WAL (Write-Ahead Log) technology has been used in many systems.
The mechanism of a WAL is that when a client writes data, the system does two things:
There are two benefits:
Why not just write the data into disk directly? You make every write directly to disk. On success, you tell client success, if the write failed you return a failed response or timeout.
In this way, you still have those two benefits.
So what is the advantage of using a WAL?
Upvotes: 18
Views: 4948
Reputation: 1631
The actual answer is to make sure you don't corrupt existing data with an incomplete write, i.e. to make your writes atomic. Append-only logs can be rolled back easily, in-place random access writes cannot.
Upvotes: 0
Reputation: 39
I am making a calculative guess here (applicable in case of spinning disks HDD. Given the fact that DB hardware is meant to be inexpensive, it would make practical sense still) -
Suppose the clients commits several transactions to the DB server. If using a WAL -
Append all deltas in the transaction to the write ahead logger (WAL) file. After sometime (or some scheduled frequency), the server takes the deltas from the logger - references the iNodes table to see which sectors/tracks are to be written with the deltas. Now, it can make all the changes to a particular track all at once. After it finishes the first sector/track, WAL can consolidate all changes for the next sector and persist it onto the particular track at once. In this approach, the disk head need not move back and forth between sectors for each small delta. The consolidated set of changes can be pushed all at once thus saving the disk head movement overhead. Also, since writes to WAL are sequential, disk seek time is avoided. The DB server knows which track WAL files are to be appended in. This sequential write to WAL files is hence optimal.
Alternatively, if we write every incremental change to the data files, depending on location of sectors/tracks, disk seek time would increase a lot and this would lead to poor performance.
Upvotes: 1
Reputation: 5557
As you note a key contribution of a WAL is durability. After a mutation has been committed to the WAL you can return to the caller, because even if the system crashes the mutation is never lost.
If you write the update directly to disk, there are two options:
If you go with 1) it is needless to say that the cost of read is O(mutations), hence pretty much every system uses 2). RocksDB uses an LSM, which uses files that are internally sorted by key. For that reason, "directly writing to disk" means that you possibly have to rewrite every record that comes after the current key. That's too expensive, so instead you
Because the memtables and the files on disk are sorted, read accesses are still reasonably fast. Updating the sorted structure in memory is easy because that's just a balanced tree. When you flush the memtable to disk and/or run a compaction, you will rewrite your file-structures to the updated state as a result of many writes, which makes each write substantially cheaper.
Upvotes: 1
Reputation: 18339
Performance.
Step two in your list is optional. For busy records, the value might not make it out of the cache and onto the disk before it is updated again. These writes do not need to be performed, with only the log writes performed for possible recovery.
Log writes can be batched into larger, sequential writes. For busy workloads, delaying a log write and then performing a single write can significantly improve throughput.
This was much more important when spinning disks were the standard technology because seek times and rotational latency were a bit issue. This is the physical process of getting the right part of the disk under the read/write head. With SSDs those considerations are not so important, but avoiding some writes, and large sequential writes still help.
Update:
SSDs also have better performance with large sequential writes but for different reasons. It is not as simple as saying "no seek time or rotational latency therefore just randomly write". For example, writing large blocks into space the SSD knows is "free" (eg. via the TRIM command to the drive) is better than read-modify-write, where the drive also needs to manage wear levelling and potentially mapping updates into different internal block sizes.
Upvotes: 9
Reputation: 2476
I have some guess.
Make every write to disk directly do not need recovery on power off. But the performance issue need to discuss in two way.
situation 1:
All your storage device is spinning disk. The WAL way will have better performance. Because when you write WAL it is sequential write. The write data to disk operation is random write. The performance for random write is very poor than sequential write for spinning disk.
situation 2: All your device is SSD. Then the performance may not be too much difference. Because sequential write and random write have almost the same performance for SSD.
Upvotes: 0