Key Concepts: Cassandra CQL Storage Format

In Cassandra 2.* version, a CQL table is stored using the same storage format that is used for storing thrift based column family. Cassandra stores extra information as part of column names to detect CQL clustering keys and other CQL columns.

CQL Row format includes its Partitioning key, Clustering key followed by a sequence of CQL columns. Each CQL column is prefixed with clustering key value. A clustering key can be defined as a combination of multiple CQL columns.

A new term is introduced to represent CQL columns called Cell. A Cell consists of cell name and cell value. A Cell name is a composite type which is a sequence of individual components. A component is encoded as three parts. First part is the value length, second part is the value and the last part is a byte representing the end of component. The end of component byte is set to 0 always for CQL columns. Each CQL column which is part of clustering key will be stored as a separate component in the cell in the order that is defined in schema.

Example1: CQL Table schema with only partitioning key and no clustering key

CREATE KEYSPACE flowerskeyspace WITH REPLICATION = { 'class' : 'SimpleStrategy' , 'replication_factor' : 1 };
CREATE TABLE flowerskeyspace.iris (
    id int PRIMARY KEY,
    class text,
    petallength float,
    petalwidth float,
    sepallength float,
    sepalwidth float
)

CQL Table Rows:

insert into iris(id, sepallength, sepalwidth, petallength, petalwidth, class) values (2,4.9,3.0,1.4,0.2,'Iris-setosa');
insert into iris(id, sepallength, sepalwidth, petallength, petalwidth, class) values (1,5.1,3.5,1.4,0.2,'Iris-setosa');
insert into iris(id, sepallength, sepalwidth, petallength, petalwidth, class) values (3,7.0,3.2,4.7,1.4,'Iris-versicolor');
insert into iris(id, sepallength, sepalwidth, petallength, petalwidth, class) values (4,6.4,3.2,4.5,1.5,'Iris-versicolor');
insert into iris(id, sepallength, sepalwidth, petallength, petalwidth, class) values (5,6.3,3.3,6.0,2.5,'Iris-virginica');
insert into iris(id, sepallength, sepalwidth, petallength, petalwidth, class) values (6,5.8,2.7,5.1,1.9,'Iris-virginica');

CQL data in Json format:

{"key": "5",
"cells": [["","",1581757206044154],
           ["class","Iris-virginica",1581757206044154],
           ["petallength","6.0",1581757206044154],
           ["petalwidth","2.5",1581757206044154],
           ["sepallength","6.3",1581757206044154],
           ["sepalwidth","3.3",1581757206044154]]},

Uncompressed SSTable Data:

CQL Table Row Storage Format:

Example2: CQL Table schema with partitioning key and clustering key

CREATE TABLE flowerskeyspace.irisplot (
    petallength float,
    sepallength float,
    id int,
    color text,
    PRIMARY KEY (petallength, sepallength, id)
)

insert into irisplot(petallength, sepallength, id, color) values (6,6.3,5,'blue');
insert into irisplot(petallength, sepallength, id, color) values (5.1,5.8,6,'blue');
insert into irisplot(petallength, sepallength, id, color) values (1.4,5.1,1,'red');
insert into irisplot(petallength, sepallength, id, color) values (1.4,4.9,2,'red');
insert into irisplot(petallength, sepallength, id, color) values (4.5,6.4,4,'green');
insert into irisplot(petallength, sepallength, id, color) values (4.7,7,3,'green');

JSON data:

[
{"key": "4.7",
"cells": [["7.0:3:","",1582054414657067],
           ["7.0:3:color","green",1582054414657067]]},
{"key": "1.4",
"cells": [["4.9:2:","",1582054358578646],
           ["4.9:2:color","red",1582054358578646],
           ["5.1:1:","",1582054337118746],
           ["5.1:1:color","red",1582054337118746]]},
{"key": "5.1",
"cells": [["5.8:6:","",1582054298177996],
           ["5.8:6:color","blue",1582054298177996]]},
{"key": "6.0",
"cells": [["6.3:5:","",1582054268167535],
           ["6.3:5:color","blue",1582054268167535]]},
{"key": "4.5",
"cells": [["6.4:4:","",1582054399453891],
           ["6.4:4:color","green",1582054399453891]]}
]

Uncompressed SSTable data:

For every CQL Row an empty component is added as a marker to allow inserting rows with NULL secondary values. For the empty component, two bytes are used to store the length of the name which is set to 0 and followed by a one byte end of component (EOC) field set to 0. This empty component marker (00 00 00) will be at the beginning of row if there is no clustering key is defined or at the end of clustering key if there is one. For example in the irisplot table we add a row with only clustering key

JSON data:

[
{"key": "4.0",
"cells": [["7.0:3:","",1582057689702366]]}
]

SSTable uncompressed data:

0000000 00 04 40 80 00 00 7f ff ff ff 80 00 00 00 00 00
0000010 00 00 00 11 00 04 40 e0 00 00 00 00 04 00 00 00
0000020 03 00 00 00 00 00 00 05 9e df 82 9b df de 00 00
0000030 00 00 00 00

Row Tombstone

When a row is deleted, only the row key with the deletion info which consists of 8 byte markedForDeleteAt timestamp and 4 byte localDeletionTime are stored.

JSON data:
[
{"key": "6.0",
"metadata": {"deletionInfo": {"markedForDeleteAt":1582065526802267,"localDeletionTime":1582065526}},
"cells": []}
]

Uncompressed SSTable Data:

0000000 00 04 40 c0 00 00 5e 4c 67 76 00 05 9e e1 55 bc
0000010 87 5b 00 00

1582065526802267 = 0x00059ee155bc875b
1582065526 = 0x 5e4c6776

Partitioning key and Clustering key

Basically in Cassandra 2.*, the CQL data model was fitted to the existing thrift data storage engine which is designed for storing Thrift column families. This caused redundant storage of clustering key data for each secondary columns. For ex: in the above irisplot table the primary key is composed of three CQL columns (petallength, sepallength, id). The first part of the primary key petallength is automatically chosen as Partitioning key. The partitioning key controls which node is responsible to store each CQL row. Rest of the CQL columns in the primary key are called Clustering key (sepallength, id).

In the above two sample rows both have same petallength of 1.4 cms. These two rows are stored in the same partition but have different clustering keys. The first row's clustering key is "4.9:2:" and the second row's clustering key is "5.1:1:"

{"key": "1.4",
"cells": [["4.9:2:","",1582054358578646],
           ["4.9:2:color","red",1582054358578646],
           ["5.1:1:","",1582054337118746],
           ["5.1:1:color","red",1582054337118746]]}

Our example table irisplot has single secondary CQL column "color" and it is prefixes with the clustering key value. If there are multiple secondary CQL columns defined then the data representing the clustering key will be repeated for each one of the secondary CQL columns. In Cassandra 3.* the storage format is changed to have the CQL semantics at the storage level to store CQL data efficiently and also to simplify the CQL queries.

To understand the internal storage format please use the python script Cassandra tools

Key Concepts

Cassandra CQL Storage Format

No comments:

Post a Comment