Apache Cassandra 4.0 Feature Preview

whoami

  • Andy Tolbert
  • Software Engineer @ DataStax on Drivers Team
  • Work on Java and Node.js Drivers
  • Rochester, MN
  • @tolbertam

Motivation

  • This talk may be premature..no imminent C* 4.0 Release Date
  • C* 4.0 entered Feature Freeze in September
  • Good to orient yourself with what's coming..

Resources

Agenda

  • New Features
  • Evaluation of ZGC
  • Enhancements
  • Q&A

New Features

Audit Logging

  • Provides visibility to how system is accessed and used
  • Valued by enterprises for especially when it comes to regulation compliance
  • Logs all login attempts in addition to all commands executed over client native protocol

Configuration

  • audit_logging section of cassandra.yaml
  • Can be enabled and disabled on the fly using nodetool enable/disableauditlog
  • Comes with 2 implementations: SLF4J and binary format
  • Is pluggable, so you can write your own IAuditLogger
  • Offers plenty of filtering options

Logger-based output

  • Configurable using SLF4J
  • Logs to system.log by default

user:cassandra|host:127.0.0.1:7000|source:/127.0.0.1|port:44826|
  timestamp:1543713221085|type:LOGIN_SUCCESS|category:AUTH|
  operation:LOGIN SUCCESSFUL

user:cassandra|host:127.0.0.1:7000|source:/127.0.0.1|port:44826|
  timestamp:1543713221184|type:SELECT|category:QUERY|
  ks:simple|scope:tbl|operation:select * from simple.tbl;

user:null|host:127.0.0.1:7000|source:/127.0.0.1|port:44828|
  timestamp:1543713227014|type:LOGIN_ERROR|category:AUTH|
  operation:LOGIN FAILURE; Provided username baduser and/or 
  password are incorrect

System Views

  • Tables that provide information about local node
  • Provides more convienient access to metrics than JMX
  • Uses new Virtual Tables feature
  • Exposed through system_views keyspace

system_views.settings

  • Show live settings configuration

cqlsh> select * from system_views.settings;

 name                                        | value
---------------------------------------------+--------
                      start_native_transport |   true
                                storage_port |   7000
                      stream_entire_sstables |   true
 stream_throughput_outbound_megabits_per_sec |    200
              streaming_connections_per_host |      1
         streaming_keep_alive_period_in_secs |    300
                 tombstone_failure_threshold | 100000
                    tombstone_warn_threshold |   1000
                         tracetype_query_ttl |  86400
                        tracetype_repair_ttl | 604800

system_views.thread_pools

  • Show thread pool usage statistics
  • Similar output to nodetool tpstats

cqlsh> select * from system_views.thread_pools;

 name                         | active_tasks | active_tasks_limit | blocked_tasks | blocked_tasks_all_time | completed_tasks | pending_tasks
------------------------------+--------------+--------------------+---------------+------------------------+-----------------+---------------
             AntiEntropyStage |            0 |                  1 |             0 |                      0 |               0 |             0
         CacheCleanupExecutor |            0 |                  1 |             0 |                      0 |               0 |             0
           CompactionExecutor |            1 |                  2 |             0 |                      0 |             236 |             0
         CounterMutationStage |            0 |                 32 |             0 |                      0 |               0 |             0
                  GossipStage |            0 |                  1 |             0 |                      0 |            1655 |             0
              HintsDispatcher |            0 |                  2 |             0 |                      0 |               0 |             0
        InternalResponseStage |            0 |                  4 |             0 |                      0 |               1 |             0
          MemtableFlushWriter |            1 |                  2 |             0 |                      0 |              61 |             0
            MemtablePostFlush |            1 |                  1 |             0 |                      0 |             105 |             0
        MemtableReclaimMemory |            0 |                  1 |             0 |                      0 |              61 |             0
               MigrationStage |            0 |                  1 |             0 |                      0 |              22 |             0
                    MiscStage |            0 |                  1 |             0 |                      0 |               0 |             0
                MutationStage |            4 |                 32 |             0 |                      0 |         1674970 |             1
    Native-Transport-Requests |           20 |                128 |             0 |                  12977 |         2622028 |             0
       PendingRangeCalculator |            0 |                  1 |             0 |                      0 |               3 |             0
 PerDiskMemtableFlushWriter_0 |            1 |                  2 |             0 |                      0 |              61 |             0
              ReadRepairStage |            0 |                  4 |             0 |                      0 |               0 |             0
                    ReadStage |            1 |                 32 |             0 |                      0 |          487835 |             1
                  Repair-Task |            0 |         2147483647 |             0 |                      0 |               0 |             0
         RequestResponseStage |            0 |                  4 |             0 |                      0 |               7 |             0
                      Sampler |            0 |                  1 |             0 |                      0 |               0 |             0
     SecondaryIndexManagement |            0 |                  1 |             0 |                      0 |               0 |             0
           ValidationExecutor |            0 |         2147483647 |             0 |                      0 |               0 |             0
            ViewBuildExecutor |            0 |                  1 |             0 |                      0 |               0 |             0
            ViewMutationStage |            0 |                 32 |             0 |                      0 |               0 |             0

system_views.caches

  • Show cache statistics
  • Similar output to nodetool info

cqlsh> select * from system_views.caches;

 name     | capacity_bytes | entry_count | hit_count | hit_ratio | recent_hit_rate_per_second | recent_request_rate_per_second | request_count | size_bytes
----------+----------------+-------------+-----------+-----------+----------------------------+--------------------------------+---------------+------------
   chunks |       95420416 |          19 |      1021 |  0.965028 |                         19 |                             22 |          1058 |     311296
 counters |       12582912 |           0 |         0 |       NaN |                          0 |                              0 |             0 |          0
     keys |       25165824 |      249751 |    297971 |  0.547795 |                        268 |                            484 |        543946 |   21978104
     rows |              0 |           0 |         0 |       NaN |                          0 |                              0 |             0 |          0

system_views.clients

  • Show client connections

cqlsh> select * from system_views.clients;

 address   | port  | connection_stage | driver_name                                  | driver_version | hostname  | protocol_version | request_count | ssl_cipher_suite | ssl_enabled | ssl_protocol | username
-----------+-------+------------------+----------------------------------------------+----------------+-----------+------------------+---------------+------------------+-------------+--------------+-----------
 127.0.0.1 | 43522 |            ready | DataStax Node.js Driver for Apache Cassandra |          4.0.0 | localhost |                4 |           423 |             null |       False |         null | anonymous
 127.0.0.1 | 43524 |            ready | DataStax Node.js Driver for Apache Cassandra |          4.0.0 | localhost |                4 |           419 |             null |       False |         null | anonymous
 127.0.0.1 | 43530 |            ready | DataStax Node.js Driver for Apache Cassandra |          4.0.0 | localhost |                4 |           418 |             null |       False |         null | anonymous
 127.0.0.1 | 43536 |            ready | DataStax Node.js Driver for Apache Cassandra |          4.0.0 | localhost |                4 |           419 |             null |       False |         null | anonymous
 127.0.0.1 | 43540 |            ready | DataStax Node.js Driver for Apache Cassandra |          4.0.0 | localhost |                4 |           419 |             null |       False |         null | anonymous
 127.0.0.1 | 43546 |            ready | DataStax Node.js Driver for Apache Cassandra |          4.0.0 | localhost |                4 |           418 |             null |       False |         null | anonymous
 127.0.0.1 | 43552 |            ready | DataStax Node.js Driver for Apache Cassandra |          4.0.0 | localhost |                4 |           419 |             null |       False |         null | anonymous
 127.0.0.1 | 43556 |            ready | DataStax Node.js Driver for Apache Cassandra |          4.0.0 | localhost |                4 |           419 |             null |       False |         null | anonymous

system_views.sstable_tasks

  • Shows active compactions
  • Similar output to nodetool compactionstats

cqlsh> select * from system_views.sstable_tasks;

 keyspace_name | table_name | task_id                              | kind       | progress | total    | unit
---------------+------------+--------------------------------------+------------+----------+----------+-------
     keyspace1 |  standard1 | 07efc380-f5cb-11e8-a574-a73cf9e584bb | compaction | 26840071 | 78425364 | bytes

More to come…

  • CASSANDRA-14670 adds 10 more system views for exposing metrics available from JMX.

Full Query Log (fqltool)

  • Enables recording queries executed on a node to file
  • recordings can than be replayed in different environments
  • and then compared..

nodetool enablefullquerylog

  • Used to turn on full query logging
  • Configured to be safe to use in production
  • Dumps log in binary format (same format as BinAuditLogger)

$ nodetool enablefullquerylog --path /tmp/simple
$ cqlsh -e "INSERT INTO simple.tbl (k, v) values (2, 3)"
$ cqlsh -e "SELECT v from simple.tbl where k = 2"
$ nodetool disablefullquerylog

fqltool dump

  • Shows the queries executed
  • Can also follow logs to see queries as they happen

$ tools/bin/fqltool dump /tmp/simple
...
Type: single-query
Query start time: 1543694938788
Protocol version: 4
Generated timestamp:-9223372036854775808
Generated nowInSeconds:1543694938
Query: INSERT INTO simple.tbl (k, v) values (2, 3);
Values: 
...
Type: single-query
Query start time: 1543694960336
Protocol version: 4
Generated timestamp:-9223372036854775808
Generated nowInSeconds:1543694960
Query: SELECT v from simple.tbl where k = 2;
Values: 

fqltool replay

  • Replays query log against target node and records results

$ tools/bin/fqltool replay 
  --keyspace simple 
  --results /tmp/results1 \
  --store-queries /tmp/queries \
  --target 127.0.0.1 /tmp/simple

$ bin/cqlsh -e "drop table simple.tbl"

$ tools/bin/fqltool replay --keyspace simple \
  --results /tmp/results2 \
  --target 127.0.0.1 /tmp/simple

fqltool compare

  • Compare the results of replaying query log in separate instances

$ tools/bin/fqltool compare --queries /tmp/queries \
  /tmp/results1/127.0.0.1 /tmp/results2/127.0.0.1

Query against /tmp/results2/127.0.0.1 failure:
Query = INSERT INTO simple.tbl (k, v) values (2, 3);, Values = 
Message: table tbl does not exist
...
Query against /tmp/results2/127.0.0.1 failure:
Query = SELECT v from simple.tbl where k = 2;, Values = 
Message: table tbl does not exist
MISMATCH:
Query = SELECT v from simple.tbl where k = 2;, Values = 
Results:
/tmp/results1/127.0.0.1: 00000003
/tmp/results2/127.0.0.1: null

Experimental Features

Experimental Features

  • New features can now be phased in as 'experimental'
  • Good way to make features available for evaluation and prove out
  • Are disabled by default, but can be enabled in cassandra.yaml
  • Materialized views are now labeled experimental (but enabled by default)

Transient Replication

  • Users desire high availability with QUORUM consistency
  • Replication Factor 3 with QUORUM can only tolerate 1 node being down
  • Need RF 5 to tolerate 2 nodes
  • ..but the cost of replication data 5 times is expensive!
  • "With Transient Replication, you can go from 3 replicas to 5 replicas, two of which are transient, without adding any hardware." [1]

Transient Replication

  • Increase availability without committing to full replicas
  • Transient replicas only store data when not enough normal replicas are available, or if normal replicas are slow
  • All nodes can be transient replicas for some ranges
  • 
    CREATE KEYSPACE tr WITH replication = {
      'class': 'NetworkTopologyStrategy',
      'datacenter1': '5/2'
    };

Transient Replication

  • For reads, if enough replicas are available for CL and data is available on a normal replica, request can succeed
  • Data is removed from transient replicas on successful incremental repair to normal replicas
  • In the particular case of 5/2, you save 40% disk space over RF 5

Java 11 Support

  • Java 11 is now supported experimentally
  • This impacts the build process a little bit
  • Java Flight Recorder is now freely available
  • Access to new GCs that focus on low pause times

Experimental Garbage Collectors

Two new and exciting GCs for Cassandra

Both are concurrent compacting GCs that focus on reducing pause times to sub milliseconds

Shenandoah

  • Non-generational concurrent compactor
  • Heap is divided into 2048 even-sized regions
  • Works with JDK8+, but requires custom build

ZGC

  • Aim is to support multiple terabyte heaps with max 10ms pause time, with only 15% throughput reduction
  • Also a non-generational concurrent compactor
  • Heap is divided into regions (ZPages) of varying sizes (2MB, 32MB, X*2MB)
  • Experimental in JDK 11+

Testing ZGC

  • 3 node i3.4xlarge C* cluster, 1 c5.9xlarge stress client
  • C* Stress write iot workload (98 byte rows)
  • Ran at fixed rate of 20k ops/sec (enough to generate roughly ~30% cpu load)
  • Also include CMS and G1 using recommended settings

GC Tuning Parameters

  • G1 params:
    -XX:+UseG1GC -XX:+ParallelRefProcEnabled -XX:G1RSetUpdatingPauseTimePercent=5 -XX:MaxGCPauseMillis=500 -XX:InitiatingHeapOccupancyPercent=70 -XX:ParallelGCThreads=16 -XX:ConcGCThreads=16 -Xms32765M -Xmx32765M -Xmn1600M
  • CMS params (default):
    -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSWaitDuration=10000 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:+CMSClassUnloadingEnabled -Xms8192M -Xmx8192M -Xmn1600M
  • ZGC params:
    -XX:+UnlockExperimentalVMOptions -XX:+UseZGC -XX:+UseLargePages -Xms32765M -Xmx32765M

Results

GC mean 95p 99p 999p
CMS 3.0ms 2.7ms 76.6ms 187.4ms
G1 5.9ms 0.7ms 192.0ms 349.7ms
ZGC 0.9ms 0.5ms 3.5ms 149.0ms

Enhancements

Zero Copy Streaming

  • Existing streaming implementation creates a lot of garbage by serializing partitions into memory
  • New implementaton uses ZeroCopy to stream over the entire SSTable without reading into memory
  • Nice performance improvement, streaming is now bound by networking and file I/O
  • Enabled by default, can disable, control throttling (def: 25 MB/s) in cassandra.yaml

Incremental Repair Improvements

This could take a bit of time to go over, if we have time later and anyone is interested we can come back to this

Incremental Repair

Incremental repair has a number of shortcomings:

  • compacted sstables between validation and anticompaction are marked unrepaired
  • can really only safely run one session at at time between replicas that share data
  • Doesn't work well with Leveled Compaction
  • anticompaction is expensive
  • Once you use incremental repair, you have to keep at it

General advisement is to use subrange repairs instead*.

Incremental Repair in 4.0

  • Process is now transactional
  • Anticompaction now performed at beginning in prepare phase (and not marked repaired)
  • SSTables involved are locked until repair completes and can only be compacted with other sstables involved in that session
  • If SSTable is part of ongoing compaction, will try to cancel wait up to a minute to see if it completes or cancels
  • SSTables 'repairedAt' metadata is updated when repair completes and are unlocked for future compaction with other sstables

More Repair Work in 4.0

Hybrid Speculative Retry

  • Speculative Retry can now combine constant and percentile based speculative retry
  • Useful for cases where a slow node causes percentile response time to increase
  • i.e.: ALTER TABLE x WITH speculative_retry='MIN(99P,50ms)' = speculate on the minimum of 99p response time and 50ms

nodetool

nodetool profileload

cqlsh> nodetool profileload

  Frequency of reads by partition:
  Table        Partition Count +/-
  basic.wide   row1      75424 0 
  basic.cas    p1        656   0
  system.paxos 7031      550   0 

  Frequency of writes by partition:
  Table        Partition Count +/-
  system.paxos 7031      585   0 
  basic.cas    p1        112   0 

  Frequency of cas contentions by partition:
  Table     Partition Count +/-
  basic.cas p1        76    0 

  Max mutation size by partition:
  Table      Partition Bytes
  basic.wide row0      1056
  basic.wide row7      1056

  Longest read query times:
  Query                                                Microseconds
  SELECT * FROM basic.wide WHERE key = row1 LIMIT 5000 25681       
  SELECT * FROM basic.wide WHERE key = row1 LIMIT 5000 16131

Upgrading

  • Capable of upgrading from any C* 3.0 and 3.x version
  • CASSANDRA-14197: SSTables can be configured to be automatically upgraded (automatic_sstable_upgrade)

Other Changes

Summary

  • Features that have been added are more operational then developer facing
  • Experimental Features allow phasing in new functionality while signaling more work needs to be done
  • Plenty of useful and needed enhancements to existing components
  • Improved tooling to assist operators

Thank You