Kramer Li
Kramer Li

Reputation: 2476

How could WAL (write ahead log) have better performance than write directly to disk?

The WAL (Write-Ahead Log) technology has been used in many systems.

The mechanism of a WAL is that when a client writes data, the system does two things:

  1. Write a log to disk and return to the client
  2. Write the data to disk, cache or memory asynchronously

There are two benefits:

  1. If some exception occurs (i.e. power loss) we can recover the data from the log.
  2. The performance is good because we write data asynchronously and can batch operations

Why not just write the data into disk directly? You make every write directly to disk. On success, you tell client success, if the write failed you return a failed response or timeout.

In this way, you still have those two benefits.

  1. You do not need to recover anything in case of power off. Because every success response returned to client means data really on disk.
  2. Performance should be the same. Although we touch disk frequently, but WAL is the same too (Every success write for WAL means it is success on disk)

So what is the advantage of using a WAL?

Upvotes: 18

Views: 4948

Answers (5)

saolof
saolof

Reputation: 1631

The actual answer is to make sure you don't corrupt existing data with an incomplete write, i.e. to make your writes atomic. Append-only logs can be rolled back easily, in-place random access writes cannot.

Upvotes: 0

Sanskar Agarwal
Sanskar Agarwal

Reputation: 39

I am making a calculative guess here (applicable in case of spinning disks HDD. Given the fact that DB hardware is meant to be inexpensive, it would make practical sense still) -

Suppose the clients commits several transactions to the DB server. If using a WAL -

Append all deltas in the transaction to the write ahead logger (WAL) file. After sometime (or some scheduled frequency), the server takes the deltas from the logger - references the iNodes table to see which sectors/tracks are to be written with the deltas. Now, it can make all the changes to a particular track all at once. After it finishes the first sector/track, WAL can consolidate all changes for the next sector and persist it onto the particular track at once. In this approach, the disk head need not move back and forth between sectors for each small delta. The consolidated set of changes can be pushed all at once thus saving the disk head movement overhead. Also, since writes to WAL are sequential, disk seek time is avoided. The DB server knows which track WAL files are to be appended in. This sequential write to WAL files is hence optimal.

Alternatively, if we write every incremental change to the data files, depending on location of sectors/tracks, disk seek time would increase a lot and this would lead to poor performance.

Upvotes: 1

midor
midor

Reputation: 5557

As you note a key contribution of a WAL is durability. After a mutation has been committed to the WAL you can return to the caller, because even if the system crashes the mutation is never lost.

If you write the update directly to disk, there are two options:

  1. write all records to the end of some file
  2. the files are somehow structured

If you go with 1) it is needless to say that the cost of read is O(mutations), hence pretty much every system uses 2). RocksDB uses an LSM, which uses files that are internally sorted by key. For that reason, "directly writing to disk" means that you possibly have to rewrite every record that comes after the current key. That's too expensive, so instead you

  1. write to the WAL for persistence
  2. update the memtables (in RAM)

Because the memtables and the files on disk are sorted, read accesses are still reasonably fast. Updating the sorted structure in memory is easy because that's just a balanced tree. When you flush the memtable to disk and/or run a compaction, you will rewrite your file-structures to the updated state as a result of many writes, which makes each write substantially cheaper.

Upvotes: 1

janm
janm

Reputation: 18339

Performance.

  • Step two in your list is optional. For busy records, the value might not make it out of the cache and onto the disk before it is updated again. These writes do not need to be performed, with only the log writes performed for possible recovery.

  • Log writes can be batched into larger, sequential writes. For busy workloads, delaying a log write and then performing a single write can significantly improve throughput.

This was much more important when spinning disks were the standard technology because seek times and rotational latency were a bit issue. This is the physical process of getting the right part of the disk under the read/write head. With SSDs those considerations are not so important, but avoiding some writes, and large sequential writes still help.

Update:

SSDs also have better performance with large sequential writes but for different reasons. It is not as simple as saying "no seek time or rotational latency therefore just randomly write". For example, writing large blocks into space the SSD knows is "free" (eg. via the TRIM command to the drive) is better than read-modify-write, where the drive also needs to manage wear levelling and potentially mapping updates into different internal block sizes.

Upvotes: 9

Kramer Li
Kramer Li

Reputation: 2476

I have some guess.

Make every write to disk directly do not need recovery on power off. But the performance issue need to discuss in two way.

situation 1:

All your storage device is spinning disk. The WAL way will have better performance. Because when you write WAL it is sequential write. The write data to disk operation is random write. The performance for random write is very poor than sequential write for spinning disk.

situation 2: All your device is SSD. Then the performance may not be too much difference. Because sequential write and random write have almost the same performance for SSD.

Upvotes: 0

Related Questions