There are two types of settings to limit the Memtable size depending on whether Memtable is stored in heap space or off heap space. Both are specified in units of MB. By default these setting are commented in cassandra.yaml.
# memtable_heap_space_in_mb: 2048
# memtable_offheap_space_in_mb: 2048
If Memtable size is not configured in then Cassandra assigns 1/4th of max heap size allocated to Cassandra process. For ex: if max size is 64GB then 16GB is set as Memtable size.
Memtable Allocation Types
The allocation type in cassandra.yaml specifies how to store Memtable data. The supported allocation types are
unslabbed_heap_buffers
heap_buffers
offheap_buffers
offheap_objects
By default Cassandra is configured to store Memtable data in heap space. It uses a Slab allocator to reduce the heap fragmentation.
memtable_allocation_type: heap_buffers
Above figure, shows the association between different components of Memtable. Memtables are created during Cassandra startup. Based on the configured Memtable storage type in cassandra.yaml, a pool is allocated either on heap or on both on heap and off heap. All the Memtables use the same Memtable Pool which controls overall Memtable storage usage. Each Memtable will have its own Allocator which will serves the allocation requests for it. Allocator objects will have a reference to the parent Memtable Pool for the configured limits.
Memtable contains a reference to its Column Family, the Memtable pool, the positions in to the Commit Log segments where its data is persisted, and the Mutations. The Commit Log Lower Bound is set when Memtable is created and it points to the position of the active commit log segment in which this Memtable mutations are stored. Commit Log Upper Bound is set to the last mutation position in the Commit Log. When a Memtable is flushed then these two bounds are used to discard the Commit Log segments which store the mutations in this Memtable. The bounds can refer to either the same Commit Log or different Commit Log segments. A single Commit Log can contain mutations from multiple Memtables and also a single Memtable can persist mutations in to multiple Commit Log segments. Please review Cassanadr Commit Log for additional details.
The Min timestamp field stores the smallest timestamp of all partitions stored in this Memtable. Live Data Size stores the size of all mutations applied to this Memtable. Partitions contain the actual Column Family mutations applied in this Memtable. The Current Operations metric tracks the number of columns updated in the stored mutations.
Slab Allocator
Slab allocator tries to combine smaller memory allocations less than 128KB in to a bigger region of size 1MB to avoid heap fragmentation. Allocations bigger than 128KB are directly allocated from JVM's heap. The basic idea is to reclaim more space of older generation of JVM heap when the allocations in a slab region have similar lifetime. For example, if a column of integer type is updated, it needs heap space of 4 bytes. It will be allocated in the current slab region and the next free offset pointer is bumped by 4 bytes and the allocation count is incremented. If an application is allocating more than 128KB size then those allocations will spill all over heap and cause heap fragmentation and eventually JVM needs to de-fragment the heap space.
Each Memtable will have its own Slab allocator. Multiple Memtable are managed using a single Slab Pool which will have the configured Memtable size settings and it ensures the overall data stored in all Memtables falls below the threshold.
Partitions
Partitions in a Memtable are managed using a Skip List data structure. Each partition stores a mutations applied to a particular row. A mutation can be insert, update or delete operation applied to a row and include zero or more columns.
cqlsh:flowerskeyspace> describe iris;
CREATE TABLE flowerskeyspace.iris (
id int PRIMARY KEY,
class text,
petallength float,
petalwidth float,
sepallength float,
sepalwidth float
)
cqlsh:flowerskeyspace> update iris set sepalwidth=6.4 where id=4;
cqlsh:flowerskeyspace> update iris set sepalwidth=6.5 where id=5
cqlsh:flowerskeyspace> update iris set sepalwidth=3.3 where id=3;
In the above example, there are three updates to three different rows.
These are stored as three partitions in the Memtable
partitions size = 3
[0] "DecoratedKey(-7509452495886106294, 00000005)" -> "[flowerskeyspace.iris]
key=5 partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647
columns=[[] | [sepalwidth]]
Row[info=[ts=-9223372036854775808] ]: | [sepalwidth=6.5 ts=1622222969042456]"
[1] "DecoratedKey(-2729420104000364805, 00000004)" -> "[flowerskeyspace.iris]
key=4 partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647
columns=[[] | [sepalwidth]]
Row[info=[ts=-9223372036854775808] ]: | [sepalwidth=6.5 ts=1622222995765139]"
[2] "DecoratedKey(9010454139840013625, 00000003)" -> "[flowerskeyspace.iris]
key=3 partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647
columns=[[] | [sepalwidth]]
Row[info=[ts=-9223372036854775808] ]: | [sepalwidth=3.3 ts=1622218525529221]"
If we update another column with Row with id = 3, the results will be
merged with the previous mutation
cqlsh:flowerskeyspace> update iris set petalwidth=1.5 where id=3;
The partition with index 2 will be updated
[2] "DecoratedKey(9010454139840013625, 00000003)" -> "[flowerskeyspace.iris]
key=3 partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647
columns=[[] | [petalwidth sepalwidth]]
Row[info=[ts=-9223372036854775808] ]: | [petalwidth=1.5 ts=1622232283101650],
[sepalwidth=3.3 ts=1622218525529221]"
Memtable Flush
Each Column family will have its own Memtable. All the live mutations to that Column family will be appended to commit log and also applied to Memtable. During the read, data from the Memtables and SSTables and the results are merged. When a Memtable heap threshold exceeds or expired or commitlog threshold reaches, it will be flushed to disk to create a SSTable.
The expiry is controlled using a flush period setting which is set as a schema property (memtable_flush_period_in_ms) on Column family definition. By default flush period is set to zero.
Memtable heap space exceeding condition is determined by configured Memtable heap size limit and internally calculated Memtable cleanup threshold. The cleanup threshold is calculated using memtable_flush_writers which defaults to 2 for single Cassandra data directory. So the memtable_cleanup_threshold is 0.33.
memtable_cleanup_threshold = 1 / (memtable_flush_writers + 1)
If configured heap space limit is 2GB then the Memtable size threshold is 2GB * 0.33333334 which is around 683 MB. When the size of all non flushed Memtables added together exceeds 683 MB, then a Memtable with largest live data size is flushed.
When total space used by commit logs exceeds the configured commitlog_total_space_in_mb, then Cassandra selects the Memtables that are associated to oldest segments and flushes them. If the oldest commit log segment contains date for different Column families then Memtables of those Column families will be flushed. By default commitlog_total_space_in_mb is set minimum value of either 8MB or 1/4th of total space of commitlog directory.
There are other factors such as repair, nodetool drain or flush can cause Memtables to be flushed to disk. A Column family can have more than one Memtable when a flush is pending on its previous Memtable.
Cassandra Memtable Metrics
Cassandra provides various metrics for Memtable as part of Column family metrics.
Metric | Description |
---|---|
memtableColumnsCount | Number of columns present in the Memtable |
memtableOnHeapSize | The amount of data stored in the Memtable which is allocated in heap memory. This also includes overhead associated with Columns |
memtableOffHeapSize | The amount of data stored in off heap space by this Memtable. This also includes overhead associated with Columns |
memtableLiveDataSize | The amount of live data size |
AllMemtableOnHeapSize | The amount of heap memory used by all the Memtables |
AllMemtableOffHeapSize | The amount of data stored in off heap space by all Memtables |
AllMemtableLiveDataSize | The amount of live data stored in all Memtables |
memtableSwitchCount | Number of times flush has resulted in the Memtable being switched out |
Commit Log provides durability to the mutations in a Memtable. Get more details on Commit Log at
Hello Sir,
ReplyDeleteThis blog is really helpful to understand memtable. I am trying to read the memtable to access actual data. The data is stored in B-tree form. One component is data size, but it is not actually data, right ? Columnfamilystore will have Keyspace and Table Names details. By observing memtable metrics and the given figures, I am not able to find where actual data is stored ? (Memtable and B-tree connection). Whether it is memtable pool or partitions? Kindly help.
The actual data in a Memtable is stored in partitions field. The data includes Row mutations. A mutation can be insert, update or delete on a specific row and it include one or more columns.
DeleteFor ex: Following table contains row primary key id
CREATE TABLE flowerskeyspace.iris (
id int PRIMARY KEY,
class text,
petallength float,
petalwidth float,
sepallength float,
sepalwidth float
)
cqlsh:flowerskeyspace> select * from iris where id=3;
id | class | petallength | petalwidth | sepallength | sepalwidth
----+-----------------+-------------+------------+-------------+------------
3 | Iris-versicolor | 4.7 | 1.4 | 7 | 3.3
(1 rows)
cqlsh:flowerskeyspace> update iris set sepalwidth=3.5 where id=3;
In the above update sepalwidth column is changed to 3.5 for row with key id=3
This creates a new B-Tree partition
BTree [flowerskeyspace.iris] key=3 partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647 columns=[[] | [sepalwidth]]
Row[info=[ts=-9223372036854775808] ]: | [sepalwidth=3.5 ts=1622218525529221]
Thank you Sir for detailed explanation. It is really helpful for me.
ReplyDeleteMy question is - what happens when an update/upsert comes in for a row that's already in memtable. Are there 2 partitions in the memtable or does the upsert happen in memory, whereas the commitlog shows two mutations?
ReplyDeleteIn the Memtable only the latest value is stored. The new updated value will overwrite the old value. The commitlog contains two mutations because it is append-only file. Memtable can not store duplicate values for the same key.
DeleteThank you for confirming. That is what I was expecting. I appreciate your help.
Delete