Cassandra 3.x SSTable Storage Format

Cassandra SSTable storage format is changed in 3.0 to support higher level CQL structure directly at Storage engine level. Older format of SSTable was designed to support a very simple model of storing basic key/value pairs which was adequate to support Thrift API. In 2.* and earlier versions of Cassandra, Thrift was predominantly used to define the database schema in the older versions of 2.*. But thrift's column family definition is not sufficient to create tables to store structured application data. CQL was later introduced in Cassandra 2.*, to enable applications to define structured data using SQL like semantics and query the data using primary key. But the simple SSTable storage format available in 2.* was not suitable to support the relational characteristics of the row keys and columns. To query CQL data using primary key, redundant information is required to be stored along with CQL columns in SSTable to associate them with the primary key. This caused lot of duplicate data in the SSTables and inefficiencies in querying the data.

In Cassandra 3.*, the SSTable format is redesigned to support CQL data types and the relational mappings, naturally at Storage engine level to reduce the SSTable size and to support executing CQL queries efficiently. Basically the new storage engine recognizes rows as first class citizens whereas in the old format, the rows are represented as a sequence of cells with no row semantics in it. The row is the primary building block for the new storage engine.


In the new format, SSTable stores a set of Partitions and each partition stores a sequence of rows or range tombstone markers. The partitions are ordered by the partition keys and the rows or markers in a partition are indexed by their clustering keys.  The Flags field in the partition determines the type of row following it. The flags are stored as one byte and if extended flags is set in the first byte then another byte is added to indicate the extended flags.

Rest of the article uses an example CQL table with partitioning key and clustering keyto describe the details.

CREATE TABLE flowerskeyspace.irisplot (
    petallength float,
    sepallength float,
    id int,
    color text,
    PRIMARY KEY (petallength, sepallength, id)
)


insert into irisplot(petallength, sepallength, id, color) values (6,6.3,5,'blue');
insert into irisplot(petallength, sepallength, id, color) values (5.1,5.8,6,'blue');
insert into irisplot(petallength, sepallength, id, color) values (1.4,5.1,1,'red');
insert into irisplot(petallength, sepallength, id, color) values (1.4,4.9,2,'red');
insert into irisplot(petallength, sepallength, id, color) values (4.5,6.4,4,'green');
insert into irisplot(petallength, sepallength, id, color) values (4.7,7,3,'green');




In the above irisplot table the primary key is composed of three CQL columns (petallength, sepallength, id). The first part of the primary key petallength is automatically chosen as Partitioning key. The partitioning key controls which node is responsible to store each CQL row. Rest of the CQL columns in the primary key are called Clustering key (sepallength, id) which is used to select a specific row columns. 


Our example table irisplot has single secondary CQL column "color" and it is prefixes with the clustering key value. If there are multiple secondary CQL columns defined then the data representing the clustering key will be repeated for each one of the secondary CQL columns in Cassandra 2.*.

JSON dump of SSTable partition in Cassandra 2.*:
{"key": "1.4",
 "cells": [["4.9:2:","",1582054358578646],
           ["4.9:2:color","red",1582054358578646],
           ["5.1:1:","",1582054337118746],
           ["5.1:1:color","red",1582054337118746]]}

JSON dump of SSTable in Cassandra 3.*:[
  {
    "partition" : {
      "key" : [ "4.7" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 18,
        "clustering" : [ 7.0, 3 ],
        "liveness_info" : { "tstamp" : "2020-03-03T22:25:59.655787Z" },
        "cells" : [
          { "name" : "color", "value" : "green" }
        ]
      }
    ]
  },
  {
    "partition" : {
      "key" : [ "1.4" ],
      "position" : 41
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 59,
        "clustering" : [ 4.9, 2 ],
        "liveness_info" : { "tstamp" : "2020-03-03T22:25:57.948204Z" },
        "cells" : [
          { "name" : "color", "value" : "red" }
        ]
      },
      {
        "type" : "row",
        "position" : 79,
        "clustering" : [ 5.1, 1 ],
        "liveness_info" : { "tstamp" : "2020-03-03T22:25:57.946200Z" },
        "cells" : [
          { "name" : "color", "value" : "red" }
        ]
      }
    ]
  },
  {
    "partition" : {
      "key" : [ "5.1" ],
      "position" : 100
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 118,
        "clustering" : [ 5.8, 6 ],
        "liveness_info" : { "tstamp" : "2020-03-03T22:25:57.943769Z" },
        "cells" : [
          { "name" : "color", "value" : "blue" }
        ]
      }
    ]
  },
  {
    "partition" : {
      "key" : [ "6.0" ],
      "position" : 140
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 158,
        "clustering" : [ 6.3, 5 ],
        "liveness_info" : { "tstamp" : "2020-03-03T22:25:57.921656Z" },
        "cells" : [
          { "name" : "color", "value" : "blue" }
        ]
      }
    ]
  },
  {
    "partition" : {
      "key" : [ "4.5" ],
      "position" : 178
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 196,
        "clustering" : [ 6.4, 4 ],
        "liveness_info" : { "tstamp" : "2020-03-03T22:25:57.950560Z" },
        "cells" : [
          { "name" : "color", "value" : "green" }
        ]
      }
    ]
  }

In the new storage engine, still the partitioning key is used to distribute the data to nodes in the cluster but the clustering key is used to index a row in a partition.
For example in the above table partition with key [ "1.4" ] contains two rows with different clustering keys [ 4.9, 2 ] and [ 5.1, 1 ].
{
    "partition" : {
      "key" : [ "1.4" ],
      "position" : 41
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 59,
        "clustering" : [ 4.9, 2 ],
        "liveness_info" : { "tstamp" : "2020-03-03T22:25:57.948204Z" },
        "cells" : [
          { "name" : "color", "value" : "red" }
        ]
      },
      {
        "type" : "row",
        "position" : 79,
        "clustering" : [ 5.1, 1 ],
        "liveness_info" : { "tstamp" : "2020-03-03T22:25:57.946200Z" },
        "cells" : [
          { "name" : "color", "value" : "red" }
        ]
      }
    ]
  }

The SSTable size in Cassandra 3.* is smaller compared to Cassandra 2.*

$ ls -al cassandra-3.11/data/data/flowerskeyspace/irisplot-ecde46205d9d11eaa60a47b84e452526/md-1-big-Data.db
-rw-r--r--  1 bharat  staff  161 Mar  3 14:26 cassandra-3.11/data/data/flowerskeyspace/irisplot-ecde46205d9d11eaa60a47b84e452526/md-1-big-Data.db

$ ls -al cassandra-2.2/data/data/flowerskeyspace/irisplot-dbf35720528411eabbb8b16d9d604ffd/lb-1-big-Data.db
-rw-r--r--  1 bharat  staff  286 Feb 18 11:34 cassandra-2.2/data/data/flowerskeyspace/irisplot-dbf35720528411eabbb8b16d9d604ffd/lb-1-big-Data.db

SSTable hex dump

Single Partition Details



The storage format also includes the details of CQL table schema in Statistics.db file
$ ./tools/bin/sstablemetadata data/data/flowerskeyspace/irisplot-ecde46205d9d11eaa60a47b84e452526/md-1-big-Data.db

KeyType: org.apache.cassandra.db.marshal.FloatType
ClusteringTypes: [org.apache.cassandra.db.marshal.FloatType, org.apache.cassandra.db.marshal.Int32Type]
StaticColumns: {}
RegularColumns: {color:org.apache.cassandra.db.marshal.UTF8Type}

Varint Encoding


For encoding data compactly, the new storage format uses varint type to store integers. it is similar concept as in Google's Protocol Buffer encoding but the Cassandra implementation is different.
 In varint encoding format, integers are stored using variable number of bytes instead of using four bytes for 32 bit integer or eight bytes for 64 bit integer. The first byte's most significant bit will indicate whether an integer is stored in a single byte or in multiple bytes. If first byte's most significant bit is set then the integer is stored in extra bytes otherwise its value is stored in the same byte. This encoding scheme can take one to nine bytes to represent integers. The number of leading bits set in the first byte from the most significant bit will indicate the number of bytes needed to store an integer.  A zero bit separates the length indicated bytes from the bits that store actual value of integer. Python code to decode the varint data is available at unpack_vint()

In the above example Partition, the row timestamp is encoded as varint. The timestamp is a 64 bit integer but it s compactly stored in three bytes using varint encoding.

 Delta Encoding


Also you can notice that the timestamp is not an absolute value. The row timestamp is stored as delta value from the encoding minimum timestamp in Statistics table.

$ tools/bin/sstablemetadata data/data/flowerskeyspace/irisplot-ecde46205d9d11eaa60a47b84e452526/md-1-big-Statistics.db
EncodingStats minTimestamp: 1583274357921656

To get the absolute value of Row timestamp we need to ad the integer value in the varint encoding ('110100111010111110011') to the encoding minimum timestamp 1583274357921656

We can use Python to convert it back to absolute value

int('110100111010111110011', 2) =  1734131
Row timestamp = 1583274357921656 + 1734131 = 1583274359655787

$ python -c "import datetime,pytz; print datetime.datetime.fromtimestamp(1583274359655787/1000000.0, pytz.utc)"
2020-03-03 22:25:59.655787+00:00

We can validate using the timestamp from JSON dump of SSTable
 "type" : "row",
 "position" : 18,
 "clustering" : [ 7.0, 3 ],
 "liveness_info" : { "tstamp" : "2020-03-03T22:25:59.655787Z" },
 "cells" : [ { "name" : "color", "value" : "green" ]

Columns


There are two types of columns Simple and Complex. Simple column is encoded as a single Cell and Complex column is encoded as a sequence of Cells preceded by deletion time and number of columns in it.



For further details on storage engine changes please review the references.


References

putting-some-structure-storage-engine

Storange engine change Patch description

Cassandra CQL Storage Format



In Cassandra 2.* version, a CQL table is stored using the same storage format that is used for storing thrift based column family. Cassandra stores extra information as part of column names to detect CQL clustering keys and other CQL columns.

CQL Row format includes its Partitioning key, Clustering key followed by a sequence of CQL columns. Each CQL column is prefixed with clustering key value. A clustering key can be defined as a combination of multiple CQL columns.

A new term is introduced to represent CQL columns called Cell. A Cell consists of cell name and cell value. A Cell name is a composite type which is a sequence of individual components.  A component is encoded as three parts. First part is the value length, second part is the value and the last part is a byte representing the end of component. The end of component byte is set to 0 always for CQL columns. Each CQL column which is part of clustering key will be stored as a separate component in the cell in the order that is defined in schema.


Example1: CQL Table schema with only partitioning key and no clustering key

CREATE KEYSPACE flowerskeyspace WITH REPLICATION = { 'class' : 'SimpleStrategy' , 'replication_factor' : 1 };
  CREATE TABLE flowerskeyspace.iris (
    id int PRIMARY KEY,
    class text,
    petallength float,
    petalwidth float,
    sepallength float,
    sepalwidth float
)



CQL Table Rows:

insert into iris(id, sepallength, sepalwidth, petallength, petalwidth, class) values (2,4.9,3.0,1.4,0.2,'Iris-setosa');
insert into iris(id, sepallength, sepalwidth, petallength, petalwidth, class) values (1,5.1,3.5,1.4,0.2,'Iris-setosa');
insert into iris(id, sepallength, sepalwidth, petallength, petalwidth, class) values (3,7.0,3.2,4.7,1.4,'Iris-versicolor');
insert into iris(id, sepallength, sepalwidth, petallength, petalwidth, class) values (4,6.4,3.2,4.5,1.5,'Iris-versicolor');
insert into iris(id, sepallength, sepalwidth, petallength, petalwidth, class) values (5,6.3,3.3,6.0,2.5,'Iris-virginica');
insert into iris(id, sepallength, sepalwidth, petallength, petalwidth, class) values (6,5.8,2.7,5.1,1.9,'Iris-virginica');




CQL data in Json format:

{"key": "5",
 "cells": [["","",1581757206044154],
           ["class","Iris-virginica",1581757206044154],
           ["petallength","6.0",1581757206044154],
           ["petalwidth","2.5",1581757206044154],
           ["sepallength","6.3",1581757206044154],
           ["sepalwidth","3.3",1581757206044154]]},


Uncompressed SSTable Data:



CQL Table Row Storage Format:





Example2: CQL Table schema with partitioning key and clustering key

CREATE TABLE flowerskeyspace.irisplot (
    petallength float,
    sepallength float,
    id int,
    color text,
    PRIMARY KEY (petallength, sepallength, id)
)


insert into irisplot(petallength, sepallength, id, color) values (6,6.3,5,'blue');
insert into irisplot(petallength, sepallength, id, color) values (5.1,5.8,6,'blue');
insert into irisplot(petallength, sepallength, id, color) values (1.4,5.1,1,'red');
insert into irisplot(petallength, sepallength, id, color) values (1.4,4.9,2,'red');
insert into irisplot(petallength, sepallength, id, color) values (4.5,6.4,4,'green');
insert into irisplot(petallength, sepallength, id, color) values (4.7,7,3,'green');





JSON data:

[
{"key": "4.7",
 "cells": [["7.0:3:","",1582054414657067],
           ["7.0:3:color","green",1582054414657067]]},
{"key": "1.4",
 "cells": [["4.9:2:","",1582054358578646],
           ["4.9:2:color","red",1582054358578646],
           ["5.1:1:","",1582054337118746],
           ["5.1:1:color","red",1582054337118746]]},
{"key": "5.1",
 "cells": [["5.8:6:","",1582054298177996],
           ["5.8:6:color","blue",1582054298177996]]},
{"key": "6.0",
 "cells": [["6.3:5:","",1582054268167535],
           ["6.3:5:color","blue",1582054268167535]]},
{"key": "4.5",
 "cells": [["6.4:4:","",1582054399453891],
           ["6.4:4:color","green",1582054399453891]]}
]

Uncompressed SSTable data:


For every CQL Row an empty component is added as a marker to allow inserting rows with NULL secondary values. For the empty component, two bytes are used to store the length of the name which is set to 0 and followed by a one byte end of component (EOC) field set to 0. This empty component marker (00 00 00) will be at the beginning of row if there is no clustering key is defined or at the end of clustering key if there is one. For example in the irisplot table we add a row with only clustering key


JSON data:

[
{"key": "4.0",
 "cells": [["7.0:3:","",1582057689702366]]}
]

SSTable uncompressed data:

0000000 00 04 40 80 00 00 7f ff ff ff 80 00 00 00 00 00
0000010 00 00 00 11 00 04 40 e0 00 00 00 00 04 00 00 00
0000020 03 00 00 00 00 00 00 05 9e df 82 9b df de 00 00
0000030 00 00 00 00



Row Tombstone

When a row is deleted, only the row key with the deletion info which consists of 8 byte markedForDeleteAt timestamp and 4 byte localDeletionTime are stored.


JSON data:
[
{"key": "6.0",
 "metadata": {"deletionInfo": {"markedForDeleteAt":1582065526802267,"localDeletionTime":1582065526}},
 "cells": []}
]

Uncompressed SSTable Data:

0000000 00 04 40 c0 00 00 5e 4c 67 76 00 05 9e e1 55 bc
0000010 87 5b 00 00                       

1582065526802267 = 0x00059ee155bc875b
1582065526 = 0x 5e4c6776

Partitioning key and Clustering key

Basically in Cassandra 2.*, the CQL data model was fitted to the existing thrift data storage engine which is designed for storing Thrift column families. This caused redundant storage of clustering key data for each secondary columns. For ex: in the above irisplot table the primary key is composed of three CQL columns (petallength, sepallength, id). The first part of the primary key petallength is automatically chosen as Partitioning key. The partitioning key controls which node is responsible to store each CQL row. Rest of the CQL columns in the primary key are called Clustering key (sepallength, id)

In the above two sample rows both have same petallength of 1.4 cms. These two rows are stored in the same partition but have different clustering keys. The first row's clustering key is "4.9:2:" and the second row's clustering key is "5.1:1:"


{"key": "1.4",
 "cells": [["4.9:2:","",1582054358578646],
           ["4.9:2:color","red",1582054358578646],
           ["5.1:1:","",1582054337118746],
           ["5.1:1:color","red",1582054337118746]]}

 Our example table irisplot has single secondary CQL column "color" and it is prefixes with the clustering key value. If there are multiple secondary CQL columns defined then the data representing the clustering key will be repeated for each one of the secondary CQL columns. In Cassandra 3.* the storage format is changed to have the CQL semantics at the storage level to store CQL data efficiently and also to simplify the CQL queries.


To understand the internal storage format please use the python script Cassandra tools