◄ prev contents next ►

⚠

The web version only has simple instructions since chapter 04, while the full book has detailed explanations and background info.

0201: Data Types

Union

This step starts using KV to build a relational DB. Like Excel, one DB can hold many tables. A table has rows and columns. Each column can choose a data type, unlike KV which only has string/bytes. We will implement 2 data types: int64 and []byte.

type CellType uint8

const (
    TypeI64 CellType = 1
    TypeStr CellType = 2
)

type Cell struct {
    Type CellType
    I64  int64
    Str  []byte
}

Cell represents different data types. Differnciated by Type, the data is either I64 or Str. Go has no union type, so some space is wasted.

Serialization

This step implements serialization and deserialization of Cell:

func (cell *Cell) Encode(toAppend []byte) []byte {
    switch cell.Type {
    case TypeI64:
        // TODO
    case TypeStr:
        // TODO
    }
}

func (cell *Cell) Decode(data []byte) (rest []byte, err error)

Requirements:

Cell.Encode() appends the serialized result to the input slice. When serializing a row next, multiple Cell values will be joined. The same slice can be reused to avoid repeated memory allocation.
Cell.Decode() deserializes from the input slice and returns the remaining data.
int64 uses binary.LittleEndian. []byte uses this format:

| length  | data |
| 4 bytes | ...  |

Endianness

A CPU uses fixed-size binary numbers, usually 8, 16, 32, or 64 bits. Bit 0 is the lowest bit. For example, 0b0101 = 4 + 1 = 5. From bit 0 to bit 3, the values are 1, 0, 1, 0. When writing down a number on paper, the low digits are on the right and the high digits on the left. When writing down an array, element 0 is on the left and higher memory addresses are on the right. This difference leads to big endian and little endian.

A 32-bit integer in a CPU register is just 32 bits. When stored in memory, it is grouped into 4 bytes. If the bytes follow natural order (low bits in byte 0, high bits in byte 3), it is little endian. If they follow written order (low bits in byte 3, high bits in byte 0), it is big endian. For example, 0x11223344 in hex: little endian is 44 33 22 11, big endian is 11 22 33 44.

Historically, some computers used big endian. This is why many network protocols like TCP/IP use big endian. Today, little endian is dominant. A little endian CPU must convert big endian data, which adds work. New data formats and protocols mostly use little endian. Big endian still has a special use (sorting) that will appear later.

Signed and Unsigned Integers

You may notice that binary.LittleEndian only has methods using uint64, and none for int64. This is because signed and unsigned integers can be converted directly:

cell.I64 = int64(binary.LittleEndian.Uint64(data[0:8]))

You need to understand how negative numbers are encoded. Using int64 as an example:

Positive numbers [0, 2⁶³) use the same encoding as uint64.
Negative numbers [−2⁶³, 0) match [2⁶³, 2⁶⁴) in uint64.

Converting between uint64 and int64 means doing nothing. The effect is:

The uint64 range is split in half. int64 positive numbers and 0 map to the first half. Negative numbers map to the second half.
For small positive numbers, uint64 and int64 look the same in bits.

The difference between signed and unsigned integers is only how the CPU interprets the bits. Since there is no real conversion, we can freely use type casts.

Sign Bit

The highest bit of a signed integer shows whether it is negative. Many people naively assume computers store numbers as sign + absolute value. That method was only used by some old machines. Modern computers use the method above, called two’s complement. If sign + absolute value were used, there would be both +0 and -0, which exists only in floating point numbers, not integers.

ⓘ

CodeCrafters.io has similar courses in many programming languages, including build your own Redis, SQLite, Docker, etc. It’s worth checking out.

◄ prev contents next ►