◄ prev contents next ►

0102: Binary Serialization

Serialization

To store data types from a programming language on disk or send them over a network, they must be converted into a byte sequence. This is called serialization.

For example, a KV pair:

type Entry struct {
    key []byte
    val []byte
}

func (ent *Entry) Encode() []byte

Implement Encode() using the following format:

| key size | val size | key data | val data |
| 4 bytes  | 4 bytes  |   ...    |   ...    |

For example, key=a and val=bb returns []byte(1, 0, 0, 0, 2, 0, 0, 0, 'a', 'b', 'b').

For slices, strings, and other variable-length types, the size must come first. Here the size is stored as little-endian uint32. Use binary.LittleEndian.PutUint32() to convert integers into 4 bytes.

func (ent *Entry) Encode() []byte {
    data := make([]byte, 4+4+len(ent.key)+len(ent.val))
    binary.LittleEndian.PutUint32(data[0:4], uint32(len(ent.key)))
    binary.LittleEndian.PutUint32(data[4:8], uint32(len(ent.val)))
    copy(data[8:], ent.key)
    copy(data[8+len(ent.key):], ent.val)
    return data
}

Deserialization

Deserialization parses a byte sequence back into data. Next, implement:

func (ent *Entry) Decode(r io.Reader) error

When calling Decode(), the caller does not know how many bytes are needed, so a slice cannot be passed. Instead, use the standard library io.Reader interface:

type Reader interface {
    Read(p []byte) (n int, err error)
}

Inside Decode(), call r.Read() to read data, like reading from a file. But since it is an interface, the underlying implementation is not necessarily a file.

Enter the db_project/0102 directory. Implement Entry.Decode(). Run tests:

go test .

io.Reader and io.Writer

io.Reader is used for input, and the corresponding output interface is io.Writer.

type Writer interface {
    Write(p []byte) (n int, err error)
}

Any type that implements Read() or Write() can be used as an io.Reader or io.Writer. The benefit of these interfaces is flexibility. For example, the parameter of Decode() is not a concrete type. In the next step, it will read from a log file, while in this step, test cases read from memory. You can check bytes.Buffer in the test cases to learn how it works.

In fact, Encode() could also use io.Writer instead of returning a slice, though it is not necessary.

Unix syscalls read and write use a similar design. They can operate on very different resources: files, network sockets, pipes, IPC. The common point is input and output.

Serialization methods

All serialization methods are similar, differing only in details. To serialize variable-length data like strings, the simplest way is to put the length first, then the data. The length is an integer, and there are many ways to encode it. Some formats use 2 bytes, some 4 bytes, some use variable-length varint, and some like Redis use decimal digits.

Besides this binary format, there are text formats such as JSON and XML. “Binary” has nothing to do with number bases; it is just the opposite of “text”. Most text formats do not encode string length, but use delimiters to mark the end of data. JSON uses quotes, XML uses tags.

Text formats look intuitive, but are hard to implement. Because encoded data cannot contain delimiters, text formats require complex escaping. Even simple JSON has many bugs across implementations. Compared to simple binary serialization, text formats also waste CPU.

Beyond complexity, text formats often have some arbitrary limits. For example, JSON cannot support arbitrary binary data, so base64 is used, which wastes even more. Many JSON libraries do not support 64-bit integers. So unless necessary, do not use text formats.

For binary serialization, there are implementations like Protobuf and MsgPack, but they are not as widely used as JSON. Many low-level projects invent their own formats. This is because binary serialization is simple and not worth adding a library dependency. Text formats, due to their complexity, are best handled by a library.

◄ prev contents next ►