RocksDB

RocksDB is one of the popular open source embedded key value database used by several other popular systems. It is built based on LevelDB code which is developed by Google. More and more opensource as wells as commercial systems started using RocksDB due to its high performance, flexibility and tuning features. Also there are may other Databases improved or customized RocksDB based on their needs.

I am going to write posts which  can help in reviewing RocksDB code. Some of the posts which I already published are here below.

RocksDB Environment

How To Debbug RocksDB Source Code 

RocksDB Put Operation

RocksDB Get Operation

RocksDB Java Native Memory Growth 

Please let me know if you need to know any particular area of RocksDB. Also let me know if any updates are needed to above posts.

RocksDB Put Operation

 

RocksDB Put operation creates a new record in the specified DB. RocksDB code is very flexible and has several levels of abstractions. If you are planning to understand the code path you can start reading from the functions specified in this post.

DB Put Operation
RocksDB stores the recently inserted data in the MemTable and periodically flushes the data to disk based SSTables (sst files). But MemTable data can be lost when power failure occurs or the application crashes. It needs to make the changes durable. It uses WAL file to provide the durability. Every write is stored in MemTable and also committed to WAL. The entry function for the Put operation is db/db_impl/db_impl_write.cc:Put() .

Write Batch

RocksDB creates a write batch even for single write. The write batch will have a set of key, value records, the count of records in it, sequence number, and a checksum. This code is in db/write_batch.cc
 
WAL Commit
 
This code is in db/db_impl/db_impl_write.cc:WriteToWAL(). It serializes the write batch in to disk storage format and and appends to the current WAL file. db/log_writer.cc:og::Writer::EmitPhysicalRecord() and file/writable_file_writer.cc:WritableFileWriter::Append()

MemTable Write

The MemTable Insert code is in WriteBatchInternal::InsertInto() function. There are different configuration option to perform MemTable write. The code is in DBImpl::WriteImpl()

RocksDB Get Operation

RocksDB Get operation retrieves the record value for a given key. RocksDB code is very flexible and has several levels of abstractions. If you are planning to understand the code path you can start reading from the functions specified in this post.

DB Get Operation
RocksDB stores the recently inserted data in the MemTable and periodically flushes the data to disk based SSTables (sst files). It needs to search both MemTable and SSTables for the given key. 
The entry function for the Get operation is db_impl_readonly.cc:DBImplReadOnly::Get() .

MemTable lookup
 
First it searches the given key in MemTable and returns the value if it is found in it. MemTable managed using SkipList and a search is done on it. The main function for this is db/memtable.cc:MemTable::GetFromTable()

Table Lookup

If the record not found in the MemTable then it will search the key in the table cache this code is in db/table_cache.cc:TableCache::Get()
 
If it is not found in table cache then it searches the files at each level starting from level 0.  If you use BlockBasedTable the default option the entry point is table/block_based/block_based_table_reader.cc:BlockBasedTable::Get(). It will search the block index to find the given key. It uses BlockIter::Seek() function to search. The keys are compared using the comparator function and the corresponding value is returned.


RocksDB Environment

RocksDB and LevelDB has an abstraction called Env (environment) which provides an interface to access Operating system specific functions. This abstraction is there in the traditional BerkeleyDB also. It nicely separates Database code from the OS specific functionality by encapsulating it.

The Env object provides an interface to access System time, File System API, Thread API, Synchronization primitives and Shared Library management API. The RocksDB distributions comes with few supported environments such as Posix, HDFS, Windows, etc. You can also customize some part of the environment by providing new implementation. For example you can customize FileSystem calls used by Env object with memory based implementation or you can mirror the writes to another environment. 

To implement a new environment for ex: Cloud environment, you need to override the  virtual functions in Env class and provide an implementation with Cloud APIs.

Posix Environment

It is the default Environment when you create a new Database. It uses various POSIX calls for proving OS level services. Below table describes the various APIs used

ServiceSystem Calls
 Time      
 clock_gettime(), gettimeofday()
 File fopen(), fclose(), open(), close(), fcntl(), stat(), fsync(), statvfs(), rename(), link(), unlink(),  access(), pread(), pwrite()
 Directory             
 mkdir(), rmdir(), getcwd(), opendir() and readdir()
 Threads pthread_create(), pthread_self(), pthread_mutex_init(), pthread_mutex_lock(), pthread_mutex_unlock(), pthread_join()
 Libraries  
 dlopen(), dlclose(), dlsym() and dlerror()

 

If you want to understand the RocksDB source code you can debug the code using the information provided in another blog post How To Debug RocksDB Source Code.

How To Debug RocksDB Source Code

 RocksDB is one of the popular open source embedded key value database used by several other popular systems. It is a derivative of LevelDB which is developed by Google. More and more opensource as wells as commercial systems started using RocksDB due to its high performance, flexibility and tuning features.

I wanted to read the source code but the code base grown over the years. I tried to search some of the function implementations and it is not very easy to find due to lot if abstraction. RocksDB site provides very good information on the architecture, internal mechanisms, configuration options and API etc. It showed a sample program in Getting Started. I wanted to use this program to debug the RocksDB code.

First we need to build the RocksDB with debug information enabled in the library. For this you can download the source code from github.


git clone https://github.com/facebook/rocksdb.git

In order to compile the code you need to install/download dependent libraries. The best place to get this information is in INSTALL.md. In this file there are instruction for each supported OS type. Please follow and install all the required libraries. The latest code uses gcc version of 7.0 or greater. I am using CentOS 7.8. It support GCC version of 4.8.5. I need to get the gcc source code and compile it and install it to get the GCC version 7.3.0. The GCC compilation tool a while.Once the GCC installed successfully I compiled the RocksDB code. 

 Here is my sample program based on the Getting Started guide.

 

$ cat Test.cc
#include <iostream>
#include "assert.h"
#include "rocksdb/db.h"

int main(int argc, char** argv) {
    rocksdb::DB* db;
    rocksdb::Options options;
    options.create_if_missing = true;
    rocksdb::Status s =
      rocksdb::DB::Open(options, "/tmp/testdb", &db);
    assert(s.ok());
    std::string value = "Hello World";
    std::string key = "key";
    std::string value1;

    if (s.ok()) s = db->Put(rocksdb::WriteOptions(), key, value);
    if (s.ok()) s = db->Get(rocksdb::ReadOptions(), key, &value1);
    if (s.ok()) std::cout << value1 << std::endl;;

    delete db;
    return 0;
}

To compile you need to include all the dependent libraries. 

g++ -v -I./include -I/usr/include -g -o Test Test.cc \
-std=gnu++17 -lzstd -lpthread -lsnappy -lbz2 -llz4 \
-lz -ldl ./librocksdb_debug.a

I used -v flag to see more information when compile is reports any errors, -g flag to include debug information for the test program. Once you compile the sample code successfully then you can start debugging the code using gdb.

The dynamic libraries linked to program are 

$ ldd Test
    linux-vdso.so.1 =>  (0x00007ffd52ffb000)
    libzstd.so.1 => /usr/local/lib/libzstd.so.1 (0x00007f26d5a7a000)
    libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f26d585e000)
    libsnappy.so.1 => /usr/lib64/libsnappy.so.1 (0x00007f26d5658000)
    libbz2.so.1 => /usr/lib64/libbz2.so.1 (0x00007f26d5448000)
    liblz4.so.1 => /usr/lib64/liblz4.so.1 (0x00007f26d5239000)
    libz.so.1 => /usr/lib64/libz.so.1 (0x00007f26d5023000)
    libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007f26d4e1f000)
    libstdc++.so.6 => /usr/local/lib64/libstdc++.so.6 (0x00007f26d4a9d000)
    libm.so.6 => /usr/lib64/libm.so.6 (0x00007f26d479b000)
    libgcc_s.so.1 => /usr/local/lib64/libgcc_s.so.1 (0x00007f26d4584000)
    libc.so.6 => /usr/lib64/libc.so.6 (0x00007f26d41b6000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f26d5cef000)

Please let me know if you encounter any issue.  I spent lot of time to figure out the things to debug this simple program. Hope this helps. Enjoy debugging the code. I also created another blog article on RocksDB environment which abstracts the operating system specifics from the DB code.