Cassandra SSTable storage format is changed in 3.0 to support higher level CQL structure directly at Storage engine level. Older format of SSTable was designed to support a very simple model of storing basic key/value pairs which was adequate to support Thrift API. In 2.* and earlier versions of Cassandra, Thrift was predominantly used to define the database schema in the older versions of 2.*. But thrift's column family definition is not sufficient to create tables to store structured application data. CQL was later introduced in Cassandra 2.*, to enable applications to define structured data using SQL like semantics and query the data using primary key. But the simple SSTable storage format available in 2.* was not suitable to support the relational characteristics of the row keys and columns. To query CQL data using primary key, redundant information is required to be stored along with CQL columns in SSTable to associate them with the primary key. This caused lot of duplicate data in the SSTables and inefficiencies in querying the data.
In Cassandra 3.*, the SSTable format is redesigned to support CQL data types and the relational mappings, naturally at Storage engine level to reduce the SSTable size and to support executing CQL queries efficiently. Basically the new storage engine recognizes rows as first class citizens whereas in the old format, the rows are represented as a sequence of cells with no row semantics in it. The row is the primary building block for the new storage engine.
In the new format, SSTable stores a set of Partitions and each partition stores a sequence of rows or range tombstone markers. The partitions are ordered by the partition keys and the rows or markers in a partition are indexed by their clustering keys. The Flags field in the partition determines the type of row following it. The flags are stored as one byte and if extended flags is set in the first byte then another byte is added to indicate the extended flags.
Rest of the article uses an example CQL table with partitioning key and clustering keyto describe the details.
CREATE TABLE flowerskeyspace.irisplot (
petallength float,
sepallength float,
id int,
color text,
PRIMARY KEY (petallength, sepallength, id)
)
insert into irisplot(petallength, sepallength, id, color) values (6,6.3,5,'blue');
insert into irisplot(petallength, sepallength, id, color) values (5.1,5.8,6,'blue');
insert into irisplot(petallength, sepallength, id, color) values (1.4,5.1,1,'red');
insert into irisplot(petallength, sepallength, id, color) values (1.4,4.9,2,'red');
insert into irisplot(petallength, sepallength, id, color) values (4.5,6.4,4,'green');
insert into irisplot(petallength, sepallength, id, color) values (4.7,7,3,'green');
In the above irisplot table the primary key is composed of three CQL columns (petallength, sepallength, id). The first part of the primary key petallength is automatically chosen as Partitioning key. The partitioning key controls which node is responsible to store each CQL row. Rest of the CQL columns in the primary key are called Clustering key (sepallength, id) which is used to select a specific row columns.
Our example table irisplot has single secondary CQL column "color" and it is prefixes with the clustering key value. If there are multiple secondary CQL columns defined then the data representing the clustering key will be repeated for each one of the secondary CQL columns in Cassandra 2.*.
JSON dump of SSTable partition in Cassandra 2.*:
{"key": "1.4",
"cells": [["4.9:2:","",1582054358578646],
["4.9:2:color","red",1582054358578646],
["5.1:1:","",1582054337118746],
["5.1:1:color","red",1582054337118746]]}
JSON dump of SSTable in Cassandra 3.*:[
{
"partition" : {
"key" : [ "4.7" ],
"position" : 0
},
"rows" : [
{
"type" : "row",
"position" : 18,
"clustering" : [ 7.0, 3 ],
"liveness_info" : { "tstamp" : "2020-03-03T22:25:59.655787Z" },
"cells" : [
{ "name" : "color", "value" : "green" }
]
}
]
},
{
"partition" : {
"key" : [ "1.4" ],
"position" : 41
},
"rows" : [
{
"type" : "row",
"position" : 59,
"clustering" : [ 4.9, 2 ],
"liveness_info" : { "tstamp" : "2020-03-03T22:25:57.948204Z" },
"cells" : [
{ "name" : "color", "value" : "red" }
]
},
{
"type" : "row",
"position" : 79,
"clustering" : [ 5.1, 1 ],
"liveness_info" : { "tstamp" : "2020-03-03T22:25:57.946200Z" },
"cells" : [
{ "name" : "color", "value" : "red" }
]
}
]
},
{
"partition" : {
"key" : [ "5.1" ],
"position" : 100
},
"rows" : [
{
"type" : "row",
"position" : 118,
"clustering" : [ 5.8, 6 ],
"liveness_info" : { "tstamp" : "2020-03-03T22:25:57.943769Z" },
"cells" : [
{ "name" : "color", "value" : "blue" }
]
}
]
},
{
"partition" : {
"key" : [ "6.0" ],
"position" : 140
},
"rows" : [
{
"type" : "row",
"position" : 158,
"clustering" : [ 6.3, 5 ],
"liveness_info" : { "tstamp" : "2020-03-03T22:25:57.921656Z" },
"cells" : [
{ "name" : "color", "value" : "blue" }
]
}
]
},
{
"partition" : {
"key" : [ "4.5" ],
"position" : 178
},
"rows" : [
{
"type" : "row",
"position" : 196,
"clustering" : [ 6.4, 4 ],
"liveness_info" : { "tstamp" : "2020-03-03T22:25:57.950560Z" },
"cells" : [
{ "name" : "color", "value" : "green" }
]
}
]
}
In the new storage engine, still the partitioning key is used to distribute the data to nodes in the cluster but the clustering key is used to index a row in a partition.
For example in the above table partition with key [ "1.4" ] contains two rows with different clustering keys [ 4.9, 2 ] and [ 5.1, 1 ].
{
"partition" : {
"key" : [ "1.4" ],
"position" : 41
},
"rows" : [
{
"type" : "row",
"position" : 59,
"clustering" : [ 4.9, 2 ],
"liveness_info" : { "tstamp" : "2020-03-03T22:25:57.948204Z" },
"cells" : [
{ "name" : "color", "value" : "red" }
]
},
{
"type" : "row",
"position" : 79,
"clustering" : [ 5.1, 1 ],
"liveness_info" : { "tstamp" : "2020-03-03T22:25:57.946200Z" },
"cells" : [
{ "name" : "color", "value" : "red" }
]
}
]
}
The SSTable size in Cassandra 3.* is smaller compared to Cassandra 2.*
$ ls -al cassandra-3.11/data/data/flowerskeyspace/irisplot-ecde46205d9d11eaa60a47b84e452526/md-1-big-Data.db
-rw-r--r-- 1 bharat staff 161 Mar 3 14:26 cassandra-3.11/data/data/flowerskeyspace/irisplot-ecde46205d9d11eaa60a47b84e452526/md-1-big-Data.db
$ ls -al cassandra-2.2/data/data/flowerskeyspace/irisplot-dbf35720528411eabbb8b16d9d604ffd/lb-1-big-Data.db
-rw-r--r-- 1 bharat staff 286 Feb 18 11:34 cassandra-2.2/data/data/flowerskeyspace/irisplot-dbf35720528411eabbb8b16d9d604ffd/lb-1-big-Data.db
SSTable hex dump
Single Partition Details
The storage format also includes the details of CQL table schema in Statistics.db file
$ ./tools/bin/sstablemetadata data/data/flowerskeyspace/irisplot-ecde46205d9d11eaa60a47b84e452526/md-1-big-Data.db
KeyType: org.apache.cassandra.db.marshal.FloatType
ClusteringTypes: [org.apache.cassandra.db.marshal.FloatType, org.apache.cassandra.db.marshal.Int32Type]
StaticColumns: {}
RegularColumns: {color:org.apache.cassandra.db.marshal.UTF8Type}
For encoding data compactly, the new storage format uses varint type to store integers. it is similar concept as in Google's Protocol Buffer encoding but the Cassandra implementation is different.
In varint encoding format, integers are stored using variable number of bytes instead of using four bytes for 32 bit integer or eight bytes for 64 bit integer. The first byte's most significant bit will indicate whether an integer is stored in a single byte or in multiple bytes. If first byte's most significant bit is set then the integer is stored in extra bytes otherwise its value is stored in the same byte. This encoding scheme can take one to nine bytes to represent integers. The number of leading bits set in the first byte from the most significant bit will indicate the number of bytes needed to store an integer. A zero bit separates the length indicated bytes from the bits that store actual value of integer. Python code to decode the varint data is available at unpack_vint()
In the above example Partition, the row timestamp is encoded as varint. The timestamp is a 64 bit integer but it s compactly stored in three bytes using varint encoding.
Also you can notice that the timestamp is not an absolute value. The row timestamp is stored as delta value from the encoding minimum timestamp in Statistics table.
$ tools/bin/sstablemetadata data/data/flowerskeyspace/irisplot-ecde46205d9d11eaa60a47b84e452526/md-1-big-Statistics.db
EncodingStats minTimestamp: 1583274357921656
To get the absolute value of Row timestamp we need to ad the integer value in the varint encoding ('110100111010111110011') to the encoding minimum timestamp 1583274357921656
We can use Python to convert it back to absolute value
int('110100111010111110011', 2) = 1734131
Row timestamp = 1583274357921656 + 1734131 = 1583274359655787
$ python -c "import datetime,pytz; print datetime.datetime.fromtimestamp(1583274359655787/1000000.0, pytz.utc)"
2020-03-03 22:25:59.655787+00:00
We can validate using the timestamp from JSON dump of SSTable
"type" : "row",
"position" : 18,
"clustering" : [ 7.0, 3 ],
"liveness_info" : { "tstamp" : "2020-03-03T22:25:59.655787Z" },
"cells" : [ { "name" : "color", "value" : "green" ]
There are two types of columns Simple and Complex. Simple column is encoded as a single Cell and Complex column is encoded as a sequence of Cells preceded by deletion time and number of columns in it.
For further details on storage engine changes please review the references.
Storange engine change Patch description
In Cassandra 3.*, the SSTable format is redesigned to support CQL data types and the relational mappings, naturally at Storage engine level to reduce the SSTable size and to support executing CQL queries efficiently. Basically the new storage engine recognizes rows as first class citizens whereas in the old format, the rows are represented as a sequence of cells with no row semantics in it. The row is the primary building block for the new storage engine.
In the new format, SSTable stores a set of Partitions and each partition stores a sequence of rows or range tombstone markers. The partitions are ordered by the partition keys and the rows or markers in a partition are indexed by their clustering keys. The Flags field in the partition determines the type of row following it. The flags are stored as one byte and if extended flags is set in the first byte then another byte is added to indicate the extended flags.
Rest of the article uses an example CQL table with partitioning key and clustering keyto describe the details.
CREATE TABLE flowerskeyspace.irisplot (
petallength float,
sepallength float,
id int,
color text,
PRIMARY KEY (petallength, sepallength, id)
)
insert into irisplot(petallength, sepallength, id, color) values (6,6.3,5,'blue');
insert into irisplot(petallength, sepallength, id, color) values (5.1,5.8,6,'blue');
insert into irisplot(petallength, sepallength, id, color) values (1.4,5.1,1,'red');
insert into irisplot(petallength, sepallength, id, color) values (1.4,4.9,2,'red');
insert into irisplot(petallength, sepallength, id, color) values (4.5,6.4,4,'green');
insert into irisplot(petallength, sepallength, id, color) values (4.7,7,3,'green');
In the above irisplot table the primary key is composed of three CQL columns (petallength, sepallength, id). The first part of the primary key petallength is automatically chosen as Partitioning key. The partitioning key controls which node is responsible to store each CQL row. Rest of the CQL columns in the primary key are called Clustering key (sepallength, id) which is used to select a specific row columns.
Our example table irisplot has single secondary CQL column "color" and it is prefixes with the clustering key value. If there are multiple secondary CQL columns defined then the data representing the clustering key will be repeated for each one of the secondary CQL columns in Cassandra 2.*.
JSON dump of SSTable partition in Cassandra 2.*:
{"key": "1.4",
"cells": [["4.9:2:","",1582054358578646],
["4.9:2:color","red",1582054358578646],
["5.1:1:","",1582054337118746],
["5.1:1:color","red",1582054337118746]]}
JSON dump of SSTable in Cassandra 3.*:[
{
"partition" : {
"key" : [ "4.7" ],
"position" : 0
},
"rows" : [
{
"type" : "row",
"position" : 18,
"clustering" : [ 7.0, 3 ],
"liveness_info" : { "tstamp" : "2020-03-03T22:25:59.655787Z" },
"cells" : [
{ "name" : "color", "value" : "green" }
]
}
]
},
{
"partition" : {
"key" : [ "1.4" ],
"position" : 41
},
"rows" : [
{
"type" : "row",
"position" : 59,
"clustering" : [ 4.9, 2 ],
"liveness_info" : { "tstamp" : "2020-03-03T22:25:57.948204Z" },
"cells" : [
{ "name" : "color", "value" : "red" }
]
},
{
"type" : "row",
"position" : 79,
"clustering" : [ 5.1, 1 ],
"liveness_info" : { "tstamp" : "2020-03-03T22:25:57.946200Z" },
"cells" : [
{ "name" : "color", "value" : "red" }
]
}
]
},
{
"partition" : {
"key" : [ "5.1" ],
"position" : 100
},
"rows" : [
{
"type" : "row",
"position" : 118,
"clustering" : [ 5.8, 6 ],
"liveness_info" : { "tstamp" : "2020-03-03T22:25:57.943769Z" },
"cells" : [
{ "name" : "color", "value" : "blue" }
]
}
]
},
{
"partition" : {
"key" : [ "6.0" ],
"position" : 140
},
"rows" : [
{
"type" : "row",
"position" : 158,
"clustering" : [ 6.3, 5 ],
"liveness_info" : { "tstamp" : "2020-03-03T22:25:57.921656Z" },
"cells" : [
{ "name" : "color", "value" : "blue" }
]
}
]
},
{
"partition" : {
"key" : [ "4.5" ],
"position" : 178
},
"rows" : [
{
"type" : "row",
"position" : 196,
"clustering" : [ 6.4, 4 ],
"liveness_info" : { "tstamp" : "2020-03-03T22:25:57.950560Z" },
"cells" : [
{ "name" : "color", "value" : "green" }
]
}
]
}
In the new storage engine, still the partitioning key is used to distribute the data to nodes in the cluster but the clustering key is used to index a row in a partition.
For example in the above table partition with key [ "1.4" ] contains two rows with different clustering keys [ 4.9, 2 ] and [ 5.1, 1 ].
{
"partition" : {
"key" : [ "1.4" ],
"position" : 41
},
"rows" : [
{
"type" : "row",
"position" : 59,
"clustering" : [ 4.9, 2 ],
"liveness_info" : { "tstamp" : "2020-03-03T22:25:57.948204Z" },
"cells" : [
{ "name" : "color", "value" : "red" }
]
},
{
"type" : "row",
"position" : 79,
"clustering" : [ 5.1, 1 ],
"liveness_info" : { "tstamp" : "2020-03-03T22:25:57.946200Z" },
"cells" : [
{ "name" : "color", "value" : "red" }
]
}
]
}
The SSTable size in Cassandra 3.* is smaller compared to Cassandra 2.*
$ ls -al cassandra-3.11/data/data/flowerskeyspace/irisplot-ecde46205d9d11eaa60a47b84e452526/md-1-big-Data.db
-rw-r--r-- 1 bharat staff 161 Mar 3 14:26 cassandra-3.11/data/data/flowerskeyspace/irisplot-ecde46205d9d11eaa60a47b84e452526/md-1-big-Data.db
$ ls -al cassandra-2.2/data/data/flowerskeyspace/irisplot-dbf35720528411eabbb8b16d9d604ffd/lb-1-big-Data.db
-rw-r--r-- 1 bharat staff 286 Feb 18 11:34 cassandra-2.2/data/data/flowerskeyspace/irisplot-dbf35720528411eabbb8b16d9d604ffd/lb-1-big-Data.db
SSTable hex dump
The storage format also includes the details of CQL table schema in Statistics.db file
$ ./tools/bin/sstablemetadata data/data/flowerskeyspace/irisplot-ecde46205d9d11eaa60a47b84e452526/md-1-big-Data.db
KeyType: org.apache.cassandra.db.marshal.FloatType
ClusteringTypes: [org.apache.cassandra.db.marshal.FloatType, org.apache.cassandra.db.marshal.Int32Type]
StaticColumns: {}
RegularColumns: {color:org.apache.cassandra.db.marshal.UTF8Type}
Varint Encoding
For encoding data compactly, the new storage format uses varint type to store integers. it is similar concept as in Google's Protocol Buffer encoding but the Cassandra implementation is different.
In varint encoding format, integers are stored using variable number of bytes instead of using four bytes for 32 bit integer or eight bytes for 64 bit integer. The first byte's most significant bit will indicate whether an integer is stored in a single byte or in multiple bytes. If first byte's most significant bit is set then the integer is stored in extra bytes otherwise its value is stored in the same byte. This encoding scheme can take one to nine bytes to represent integers. The number of leading bits set in the first byte from the most significant bit will indicate the number of bytes needed to store an integer. A zero bit separates the length indicated bytes from the bits that store actual value of integer. Python code to decode the varint data is available at unpack_vint()
In the above example Partition, the row timestamp is encoded as varint. The timestamp is a 64 bit integer but it s compactly stored in three bytes using varint encoding.
Delta Encoding
Also you can notice that the timestamp is not an absolute value. The row timestamp is stored as delta value from the encoding minimum timestamp in Statistics table.
$ tools/bin/sstablemetadata data/data/flowerskeyspace/irisplot-ecde46205d9d11eaa60a47b84e452526/md-1-big-Statistics.db
EncodingStats minTimestamp: 1583274357921656
To get the absolute value of Row timestamp we need to ad the integer value in the varint encoding ('110100111010111110011') to the encoding minimum timestamp 1583274357921656
We can use Python to convert it back to absolute value
int('110100111010111110011', 2) = 1734131
Row timestamp = 1583274357921656 + 1734131 = 1583274359655787
$ python -c "import datetime,pytz; print datetime.datetime.fromtimestamp(1583274359655787/1000000.0, pytz.utc)"
2020-03-03 22:25:59.655787+00:00
We can validate using the timestamp from JSON dump of SSTable
"type" : "row",
"position" : 18,
"clustering" : [ 7.0, 3 ],
"liveness_info" : { "tstamp" : "2020-03-03T22:25:59.655787Z" },
"cells" : [ { "name" : "color", "value" : "green" ]
Columns
There are two types of columns Simple and Complex. Simple column is encoded as a single Cell and Complex column is encoded as a sequence of Cells preceded by deletion time and number of columns in it.
For further details on storage engine changes please review the references.
References
putting-some-structure-storage-engineStorange engine change Patch description