◄ prev contents next ►

⚠

The web version only has simple instructions since chapter 04, while the full book has detailed explanations and background info.

0105: Checksum

Incomplete Writes

When appending a record to the log, we want it to either be completely written or not written at all. This can be called atomicity. Incomplete writes are known as torn writes. However, file writes do not guarantee atomicity in the case of power loss. Not only is the appended data not atomic, but even the file size could be incorrect. For example, appending 1000-bytes record may result in one of the following situations:

File size increases by 1000, but data isn’t fully written.
File size increases by 1000, but no data is written, and the new space is filled with 0s or garbage.
File size increases by 500.

In any case, only the last record will be affected, while previously fsynced records remain intact. This is another reason to use a log in databases.

Disk Write Atomicity

The CPU provides atomic memory read/write operations that handle data the size of an integer. While CPU atomic operations focus on concurrency, we’re considering atomicity for persistence in the case of power loss. At the hardware level, writing a single sector is likely atomic. A sector is the smallest unit of disk read/write, usually 512B or 4K.

Many software systems rely on atomicity at the sector level to achieve larger-scale atomicity. For example, when writing a log, a sector at the beginning of the log can be reserved to store the position of the last log entry. After the log is successfully fsynced, the sector is updated (including fsyncing this sector). This ensures atomicity for the entire log.

Even without atomicity at the hardware level, software can achieve atomicity without needing 2 fsyncs.

Checksum

Suppose log writes are not atomic, but if we can detect an incomplete write, we can simply ignore it. The last record affected will be the one before the last successful fsync. A checksum can help here. A checksum is a hash, and different data will almost certainly have different checksums. By storing the checksum for each record, we can identify incomplete writes.

Use the standard library’s crc32.ChecksumIEEE() to compute the checksum for log records and prepend it to the record:

|  crc32  | key size | val size | deleted | key data | val data |
| 4 bytes | 4 bytes  | 4 bytes  | 1 byte  |   ...    |   ...    |

Modify these functions:

func (ent *Entry) Encode() []byte
func (ent *Entry) Decode(r io.Reader) error

var ErrBadSum = errors.New("bad checksum")

Entry.Decode() should return the appropriate errors:

ErrBadSum: checksum mismatch (data not fully written).
io.EOF: reached the end of the file (all records are valid).
io.ErrUnexpectedEOF: file ended prematurely (file size is incorrect, data not fully written).

io.ReadFull()

The io.Reader interface allows returning fewer bytes than the requested buf length. For example, requesting 4 bytes might return only 3, but this doesn’t mean the end of the data (eof). While this typically won’t happen when reading from a file, it’s common with sources like TCP sockets. So when using io.Reader, you should loop until the returned byte count satisfies the buffer size. The standard library’s io.ReadFull() can replace this loop:

func ReadFull(r Reader, buf []byte) (n int, err error)

If the buffer is only partially filled and eof is reached, io.ReadFull() will return io.ErrUnexpectedEOF. This perfectly matches the requirements of the Decode() function.

Handling Incomplete Log Records

Check the error returned by Entry.Decode() to ignore the last incomplete log record. That is, eof=true.

func (log *Log) Read(ent *Entry) (eof bool, err error)

The test case for this step simulates recovering from the last incomplete log write.

You may wonder: how do we know if a failed checksum is in the last log entry? There is no way to know, because the record’s size header might be incorrect.

Checksum Use

In addition to incomplete writes, checksums can also detect data corruption caused by hardware failures, such as flipped bits due to disk or memory faults. This is a rare occurrence with normal hardware. However, this is not the purpose of checksums in databases, as databases cannot recover from such silent data losses. If a record in the middle of the log is corrupted, it’s impossible to know if subsequent records exist.

Cryptographic hash functions (like sha256 or md5) could also be used as checksums. However, there is no reason to use them, because specialized checksum functions are smaller and faster. Even simple checksum methods, like the 16-bit integer sum used in TCP/IP, are sufficient. However, care must be taken for the case where all bytes are 0. In such cases, the checksum should not be 0, and crc32 does not have this issue.

Summary

With log + checksum + fsync, the database can recover after a power loss and ensure previously successful writes are not lost. This is the core functionality of a database. This is why SQLite is commonly used for mobile data storage rather than simply writing JSON to a file. It’s not just for easier querying but because basic file operations cannot ensure durability and atomicity.

For now, we have only a log. If data structures are added in the future, atomicity and durability (A and D of ACID) must be reconsidered.

ⓘ

CodeCrafters.io has similar courses in many programming languages, including build your own Redis, SQLite, Docker, etc. It’s worth checking out.

◄ prev contents next ►