When we founded WitFoo five years ago, we wanted to analyze data in SIEM and other data stacks to provide craft knowledge that would stabilize communications within cybersecurity teams and between those teams and their organizations. A few months into that journey we realized there were fundamental problems in how existing SIEM and log aggregators collected and stored data which prompted us to add big data processing to the scope of our venture.
Elastic is no longer Apache Licensed
Recent news of Elastic abandoning the Apache 2.0 licensing model has sent devastating shockwaves through the data community. If you are not familiar with the impact, VM Brasseure summarizes the impact and implications well in Elasticsearch and Kibana are now business risks. This move puts development teams deciding between 1) increasing product pricing to pay for a commercial Elastic license (because of increased COGS) 2) staffing several FTE to smash ongoing security vulnerabilities 3) forking the last Apache 2.0 licensed version of Elastic into a new OSS project or 4) dropping Elastic for another more friendly platform. In this entry I am going to give some insight into navigating the 4th option.
WitFoo Precinct 4.0 on Elastic
In September 2017, we released WitFoo Precinct 4.0 built on Elasticsearch. Our hope was that we could build our analytics on top of the work Elastic had done. We met with teams at Elastic to weigh the merits of licensing X-Pack and other features in our platform. We also tested the platform on 10 separate production networks.
Reasons we dropped Elastic
At WitFoo, we build our product using an approach we call Metric Driven Development. We generate detailed metrics in all components of our product to inform our testing in development and production. By mid-October 2017, the metrics we were seeing exposed problems that made it clear our platform was going to require a major architecture pivot. The core problems we could not overcome with Elastic were:
- Non-linear scale – Every node we added also added disk and memory overhead. This meant that our customers got less value from every Elastic node added to the cluster.
- Memory Requirements – Built on Lucene, the use of memory in the ELK stack gets extremely expensive as clusters grow beyond 10 TB.
- High Disk IOPS – Disk thrashing was an issue when we needed to perform batch retrospective analysis on large data sets. Customer deployments on slower, magnetic disks were completely paralyzed, making our product only useful to those that could afford fast solid-state drives or could pay a premium for fast cloud IOPS rates.
- Variable Configuration – Configuring Elastic required detailed understanding of a customer’s data rates, data formats and the underlying hardware. The variability here required significant professional services on each customer deployment which violated one of our key business directives to deliver turn-key functionality.
- Non-linear Parsing Scale – Using Logstash to setup our adaptive parsing engine incurred significant overhead, both on the dev (too many quirks and verbose language) and performance side (high cpu usage for small loads).
The short of it was, Elastic was too complicated to manage and required expensive hardware to be useful. This would not work with our business directives of cost-contained pricing, turn-key operation and ability to run in any cloud, VM or physical environment.
The Half Measure of Precinct 5.0
Those familiar with the story of WitFoo know that the second half of 2017 was a dark and heavy part of our history. When Ryan, our Chief Data Researcher, and I were looking at a forklift of the architecture, we were low on energy and time to do it. WitFoo Precinct 5.0 was built on Oracle MySQL NDB in hopes we could limp along on a venerable, reliable platform. We ran into endless Brewer (CAP) theorem problems (mostly around synchronization) that quickly informed us we couldn’t camp there long.
Enter Apache Cassandra
In 2019, the dev team got together in Grand Rapids, Michigan and mapped out what it would take to get terabyte throughput and petabyte retention in a streaming platform. After going through our headaches with Elastic, MySQL, Hadoop and NDB we circled back to some early experimentation with Cassandra. The hype in the Cassandra Community claimed it could address all our trauma triggers plus deliver the following:
- Data center awareness
- Memory contained operation
- Load balanced writes and reads
- An Apache Foundation license (no COGS)
The Cassandra Way
We were a small, exhausted development team and extremely tempted to shove our existing code into Cassandra. We took a deep breath and decided that if we were going to do our 5th forklift, we were going to lean into Cassandra philosophies. We were going to adapt our code to Cassandra best-practices and do our best to use OOTB (out of the box) configurations. We were going to re-think our code to take advantage of Cassandra’s strengths, which required structuring things around Cassandra’s design.
JSON is the Schema
As old data guys, CREATE TABLE… has filled tomes of fun memories. The old days of defining foreign keys and establishing precision in column definitions were screaming to be part of our conversation. One of the core realizations in Cassandra (and NoSQL more broadly) is JSON format carries a ton of benefits. You can handle data transformation in the processing, API and UI levels. Version control changes of schema changes don’t require expensive data table maintenance.
JOINS are Dead
JSON objects can be created with all relevant relationship data in its structure. If we need an object representing different relationships, we write a different JSON object. There is no longer a need to establish cross-table relationships that slow queries to a crawl.
Compression is Faster??
JSON compression algorithms have been fine-tuned over the decades and perform well with the type of data we process. Cassandra natively supports compression. Compressed data writes and reads = reduced disk IOPS. Yep, compression with JSON in Cassandra not only saves disk space, it makes processing faster.
Most of our Cassandra tables have three columns 1) partition (string) 2) created_at (time_uuid) and 3) JSON. If you are coming from Elastic, a partition is closely related to an index and our created_at is a record id. When we perform batch processing in Cassandra, we can query
select * from table where partition = xxx; to get all rows in the partition or
select * from table where created_at = yyy and partition = xxx; to get a specific record. These types of queries are extremely fast in Cassandra.
TTL – The Free Delete
When we insert rows into Cassandra, we add a time to live (TTL) to the row. This moves data deletion and garbage collection as a task for Cassandra to organically address. We have logic in our processing code to set those.
Avoiding Data Hotspots
Big Data platforms can suffer from “hot spots” when some indexes/partitions/buckets have much more data than its peers. It can choke the data engine and overwhelm the heap of the code processing that partition. To keep queries clean, we count how many inserts we make to a partition then stop at the max. Using our highest volume data type (a normalized record called WitFoo Artifact), we write 1,000 rows to a partition in the table. We set the TTL for rows in this table based on retention settings provided by the user.
This approach not only makes it easy for Cassandra to compact, garbage collect, and replicate, it also makes it easy for our batch processing engines to load and process these partitions. The threads request a complete partition from Cassandra. Because we set the maximum size of the partition, we do not have to worry about the memory heap of the thread being overloaded (causing OOM.)
Index of Data
As data is being processed, our code is building an index of what data is in each partition. That data is persisted as JSON. When we need to find specific rows, we load the index JSON objects and merge them to produce a list of partitions that have the records we need. We then fetch those partitions from Cassandra (which is crazy fast) and filter in our processing thread the records that match our criteria.
We code all of our processing in Scala and use the Datastax Java Driver for Cassandra to interact with our data clusters. Each node has a daemon running which coordinates the addition and deletion of new data, streaming/processing, and management/ui nodes. A detailed overview of our platform can be found on the Demo page of our website and architecture training can be found on the WitFoo Community (free log-in required.)
Moving through six data architectures to get to Apache Cassandra was a painful process that we wouldn’t wish on an enemy. When we made that commitment to making the change and to embrace the Cassandra philosophies, we found lasting and growing satisfaction in the move. In addition to being a fantastic big-data platform, it also has the merit of being an Apache Software Foundation project which keeps us from night terrors of licensing bait and switches like Elastic just dropped on its “partners.” It’s our hope that you find some inspiration in our lessons and avoid some of the scars we picked up. Humankind is at a pivotal point in history, that requires us to work together to find new innovative ways to process the petabytes of data we create every day, and the WitFoo R&D team has found a friend in Cassandra through our endeavors to meet big-data needs in cybersecurity operations.