◄ prev contents next ►

0101: KV Interface

Memory vs. disk

Software called databases comes in many forms. Some are pure in-memory, such as Redis and Memcached. Some are disk-based, such as MySQL and SQLite.

In-memory databases are limited by RAM size, so traditional databases are disk-based. But even disk-based databases often include in-memory data structures. So different types of databases are not unrelated. We can start from an in-memory KV and build the project by adding code step by step.

A database is an application of data structures. Whether data is stored in memory or on disk makes no difference in principle, but many issues appear in implementations. You rarely see these topics in books, but you can learn them in a project.

Key-value vs. table

Databases can be classified by their interfaces: key-value (KV), relational, and other special types. KV is like map or dict in programming languages, with operations like get, set, del. Relational databases use tables (rows and columns), usually operated with SQL.

In terms of features, relational databases seem to do more. However, more complex relational databases are built on simpler KV systems, often called storage engines. For example, LevelDB and RocksDB can be used as standalone KV stores, and also have SQL DBs built on top. So database implementation starts from KV.

OLAP vs. OLTP

Databases can be divided by usage into 2 types: Online Analytical Processing and Online Transaction Processing. These words are just names; “analytical” and “transaction” have no precise meaning. OLAP focuses on analyzing large amounts of data, while OLTP focuses on returning results in real time using indexes. For example:

OLAP: to count active users, Bob runs a SQL query and COUNT()s matching users.
OLTP: to show user count in real time in an app. But there are many users, an extra counter is maintained.

OLTP needs real-time results, so each operation has an upper bound on resource use (CPU, IO, memory). If you know data structures, this bound is usually O(log N). An index uses extra information to support real-time queries, trading space for time. In applications, “index” does not always mean a database index feature. In the example above, the app maintains its own extra index for counting.

OLAP workloads may consume large resources, so they are often stored separately from OLTP. Traditional databases like MySQL and PG are OLTP types. After the rise of “big data”, databases specialized for OLAP became popular.

Although both are called relational DBs, OLAP usually stores data by column, while OLTP stores by row. Their underlying tech and use cases differ. This project only considers OLTP. Do not mix them up when learning.

In-memory KV

An in-memory KV starts from the simplest map:

type KV struct {
    mem map[string][]byte
}

func (kv *KV) Open() error {
    kv.mem = map[string][]byte{} // empty
    return nil
}

func (kv *KV) Close() error { return nil }

func (kv *KV) Get(key []byte) (val []byte, ok bool, err error)
func (kv *KV) Set(key []byte, val []byte) (updated bool, err error)
func (kv *KV) Del(key []byte) (deleted bool, err error)

Keys and values are []byte, so they can hold any binary data.
Since Go maps cannot use []byte as keys, string is used.
Disk IO will be added later, so these interfaces return error.
Set and Del must report whether the database state changed.

Enter the db_project/0101 directory. Implement Get, Set, and Del. Run tests:

go test .

ACID

Atomicity, Consistency, Isolation, Durability, the so-called ACID, are often used to describe databases, as if they are 4 DB properties. Many people feel they are hard to understand. This is because they are vague ideas without clear definitions, and often mean different things.

Atomicity literally means indivisible. For example, writing n bytes to a file:

With concurrent reads, a reader may see half-written data.
If power fails mid-write, after recovery only half the data may exist.

Both can be described as lack of atomicity, and databases must address them. But one is about concurrency, the other about durability. They are unrelated problems.

Consistency is used to describe internal database logic, business logic, or distributed systems, without even a vague definition.

Isolation refers to whether a transaction is affected by other concurrent transactions.

Durability means that once the DB reports success, the data can be trusted not to disappear. In implementation, this cannot be considered alone as atomicity is also involved.

Database behavior is complex, and ACID does not fully describe it. In practice, you must learn the specific behavior of each database. When implementing a database yourself, you will directly face real problems databases must solve, and their solutions.

◄ prev contents next ►