Cassandra Memtable


Memtables store the current mutations applied to Column families. They function as a write back cache and provide faster write performance and faster read performance for recently written data. Mutations are organized in sorted order using skip list data structure in the Memtable. Each Column family is associated with its own Memtable. In this blog, on-heap based Memtable type is described.
 
Memtable Size

There are two types of settings to limit the Memtable size depending on whether Memtable is stored in heap space or off heap space. Both are specified in units of MB.  By default these setting are commented in cassandra.yaml. 

# memtable_heap_space_in_mb: 2048
# memtable_offheap_space_in_mb: 2048

If Memtable size is not configured in then Cassandra assigns 1/4th of max heap size allocated to Cassandra process. For ex: if max size is 64GB then 16GB is set as Memtable size.

Memtable Allocation Types

The allocation type in cassandra.yaml specifies how to store Memtable data. The supported allocation types are

unslabbed_heap_buffers
heap_buffers
offheap_buffers
offheap_objects

By default Cassandra is configured to store Memtable data in heap space. It uses a Slab allocator to reduce the heap fragmentation.

memtable_allocation_type: heap_buffers

 

Above figure, shows the association between different components of Memtable. Memtables are created during Cassandra startup. Based on the configured Memtable storage type in cassandra.yaml, a pool is allocated either on heap or on both on heap and off heap. All the Memtables use the same Memtable Pool which controls overall Memtable storage usage. Each Memtable will have its own Allocator which will serves the allocation requests for it.  Allocator objects will have a reference to the parent Memtable Pool for the configured limits.

Memtable contains a reference to its Column Family, the Memtable pool, the positions in to the Commit Log segments where its data is persisted, and the Mutations. The Commit Log Lower Bound is set when Memtable is created and it points to the position of the active commit log segment in which this Memtable mutations are stored. Commit Log Upper Bound is set to the last mutation position in the Commit Log. When a Memtable is flushed then these two bounds are used to discard the Commit Log segments which store the mutations in this Memtable. The bounds can refer to either the same Commit Log or different Commit Log segments. A single Commit Log can contain mutations from multiple Memtables and also a single Memtable can persist mutations in to multiple Commit Log segments. Please review Cassanadr Commit Log for additional details.

The Min timestamp field stores the smallest timestamp of all partitions stored in this Memtable. Live Data Size stores the size of all mutations applied to this Memtable. Partitions contain the actual Column Family mutations applied in this Memtable.  The Current Operations metric tracks the number of columns updated in the stored mutations.

 

Slab Allocator

Slab allocator tries to combine smaller memory allocations less than 128KB in to a bigger region of size 1MB to avoid heap fragmentation. Allocations bigger than 128KB are directly allocated from JVM's heap. The basic idea is to reclaim more space of older generation of JVM heap when the allocations in a slab region have similar lifetime. For example, if a column of integer type is updated, it needs heap space of 4 bytes. It will be allocated in the current slab region and the next free offset pointer is bumped by 4 bytes and the allocation count is incremented. If an application is allocating more than 128KB size then those allocations will spill all over heap and cause heap fragmentation and eventually JVM needs to de-fragment the heap space.

Each Memtable will have its own Slab allocator. Multiple Memtable are managed using a single Slab Pool which will have the configured Memtable size settings and it ensures the overall data stored in all Memtables falls below the threshold.


 


Partitions

Partitions in a Memtable are managed using a Skip List data structure. Each partition stores a mutations applied to a particular row. A mutation can be insert, update or delete operation applied to a row and include zero or more columns. 

cqlsh:flowerskeyspace> describe iris;

CREATE TABLE flowerskeyspace.iris (
    id int PRIMARY KEY,
    class text,
    petallength float,
    petalwidth float,
    sepallength float,
    sepalwidth float
)

cqlsh:flowerskeyspace> update iris set sepalwidth=6.4 where id=4;

cqlsh:flowerskeyspace> update iris set sepalwidth=6.5 where id=5

cqlsh:flowerskeyspace> update iris set sepalwidth=3.3 where id=3;


In the above example, there are three updates to three different rows.
These are stored as three partitions in the Memtable

partitions size = 3

[0] "DecoratedKey(-7509452495886106294, 00000005)" -> "[flowerskeyspace.iris]
key=5 partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647
columns=[[] | [sepalwidth]]
    Row[info=[ts=-9223372036854775808] ]:  | [sepalwidth=6.5 ts=1622222969042456]"

[1] "DecoratedKey(-2729420104000364805, 00000004)" -> "[flowerskeyspace.iris]
key=4 partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647
columns=[[] | [sepalwidth]]
    Row[info=[ts=-9223372036854775808] ]:  | [sepalwidth=6.5 ts=1622222995765139]"

[2] "DecoratedKey(9010454139840013625, 00000003)" -> "[flowerskeyspace.iris]
key=3 partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647
columns=[[] | [sepalwidth]]
    Row[info=[ts=-9223372036854775808] ]:  | [sepalwidth=3.3 ts=1622218525529221]"

If we update another column with Row with id = 3, the results will be
merged with the previous mutation

cqlsh:flowerskeyspace> update iris set petalwidth=1.5 where id=3;

The partition with index 2 will be updated

[2] "DecoratedKey(9010454139840013625, 00000003)" -> "[flowerskeyspace.iris]
key=3 partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647
columns=[[] | [petalwidth sepalwidth]]
Row[info=[ts=-9223372036854775808] ]: | [petalwidth=1.5 ts=1622232283101650],
[sepalwidth=3.3 ts=1622218525529221]" 

 

 

Memtable Flush

Each Column family will have its own Memtable. All the live mutations to that Column family will be appended to commit log and also applied to Memtable. During the read, data from the Memtables and SSTables and the results are merged. When a Memtable heap threshold exceeds or expired or commitlog threshold reaches, it will be flushed to disk to create a SSTable. 

The expiry is controlled using a flush period setting which is set as a schema property (memtable_flush_period_in_ms) on Column family definition. By default flush period is set to zero. 

Memtable heap space exceeding condition is determined by configured Memtable heap size limit and internally calculated Memtable cleanup threshold. The cleanup threshold is calculated using memtable_flush_writers which defaults to 2 for single Cassandra data directory. So the memtable_cleanup_threshold is 0.33.

memtable_cleanup_threshold = 1 / (memtable_flush_writers + 1)

If configured heap space limit is 2GB then the Memtable size threshold is 2GB * 0.33333334 which is around 683 MB. When the size of all non flushed Memtables added together exceeds 683 MB, then a Memtable with largest live data size is flushed.

When total space used by commit logs exceeds the configured commitlog_total_space_in_mb, then Cassandra selects the Memtables that are associated to oldest segments and flushes them. If the oldest commit log segment contains date for different Column families then Memtables of those Column families will be flushed. By default commitlog_total_space_in_mb is set minimum value of either 8MB or 1/4th of total space of commitlog directory.

There are other factors such as repair, nodetool drain or flush can cause Memtables to be flushed to disk. A Column family can have more than one Memtable when a flush is pending on its  previous Memtable.

Cassandra Memtable Metrics

Cassandra provides various metrics for Memtable as part of Column family metrics.

MetricDescription
 memtableColumnsCount      
Number of columns present in the Memtable
 memtableOnHeapSizeThe amount of data stored in the Memtable which is allocated in heap memory. This also includes overhead associated with Columns
 memtableOffHeapSizeThe amount of data stored in off heap space by this Memtable. This also includes overhead associated with Columns
 memtableLiveDataSizeThe amount of live data size
 AllMemtableOnHeapSizeThe amount of heap memory used by all the Memtables
 AllMemtableOffHeapSizeThe amount of data stored in off heap space by all Memtables
 AllMemtableLiveDataSize   
The amount of live data stored in all Memtables
 memtableSwitchCountNumber of times flush has resulted in the Memtable being switched out

 Commit Log provides durability to the mutations in a Memtable. Get more details on Commit Log at

Commit Log

6 comments:

  1. Hello Sir,
    This blog is really helpful to understand memtable. I am trying to read the memtable to access actual data. The data is stored in B-tree form. One component is data size, but it is not actually data, right ? Columnfamilystore will have Keyspace and Table Names details. By observing memtable metrics and the given figures, I am not able to find where actual data is stored ? (Memtable and B-tree connection). Whether it is memtable pool or partitions? Kindly help.

    ReplyDelete
    Replies
    1. The actual data in a Memtable is stored in partitions field. The data includes Row mutations. A mutation can be insert, update or delete on a specific row and it include one or more columns.

      For ex: Following table contains row primary key id
      CREATE TABLE flowerskeyspace.iris (
      id int PRIMARY KEY,
      class text,
      petallength float,
      petalwidth float,
      sepallength float,
      sepalwidth float
      )
      cqlsh:flowerskeyspace> select * from iris where id=3;

      id | class | petallength | petalwidth | sepallength | sepalwidth
      ----+-----------------+-------------+------------+-------------+------------
      3 | Iris-versicolor | 4.7 | 1.4 | 7 | 3.3

      (1 rows)

      cqlsh:flowerskeyspace> update iris set sepalwidth=3.5 where id=3;

      In the above update sepalwidth column is changed to 3.5 for row with key id=3

      This creates a new B-Tree partition
      BTree [flowerskeyspace.iris] key=3 partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647 columns=[[] | [sepalwidth]]
      Row[info=[ts=-9223372036854775808] ]: | [sepalwidth=3.5 ts=1622218525529221]

      Delete
  2. Thank you Sir for detailed explanation. It is really helpful for me.

    ReplyDelete
  3. My question is - what happens when an update/upsert comes in for a row that's already in memtable. Are there 2 partitions in the memtable or does the upsert happen in memory, whereas the commitlog shows two mutations?

    ReplyDelete
    Replies
    1. In the Memtable only the latest value is stored. The new updated value will overwrite the old value. The commitlog contains two mutations because it is append-only file. Memtable can not store duplicate values for the same key.

      Delete
    2. Thank you for confirming. That is what I was expecting. I appreciate your help.

      Delete