AtlasDB is a massively scalable datastore and transactional layer that can be placed on top of any key-value store to give it ACID properties. This is the first of several blog posts that will introduce AtlasDB and describe how we built it at Palantir.
In 2010, Daniel Peng and Frank Debek of Google put out a paper entitled, “Large-scale Incremental Processing Using Distributed Transactions and Notifications“. The paper describes a system in use at Google named Percolator, which sits on top of BigTable, Google’s distributed key-value store.
Google needed transactions with ACID properties. They also needed a highly fault-tolerant system to deal with the potential failure of key parts of their system under load at massive scales. This drove them to push the accounting data for the transactions into BigTable as well, as BigTable already handled both replication of its data and fault tolerance.
Unfortunately, due to the number of writes involved in the transaction accounting, they saw an approximately 75% performance hit when using transactions. Percolator was built to enable incremental updates to the search indexes for the Google search engine, which was previously a periodic batch process. In this case, the extra reliability afforded by using BigTable to track the transactions was the important factor; even though there was a significant performance hit (over using raw BigTable), the overall performance was high enough to meet its design criteria.
Meanwhile, at Palantir, we were hitting a similar obstacle. The interactive analytic core of Palantir Gotham, which was originally built with a traditional RDBMS as its backing store, was hitting the limits of economical and easy scaling. We needed to move to a distributed backing store for scalability, but we needed transactions to enable our higher-level Revisioning Database to work correctly.
Percolator presented interesting possibilities, but with a 75% performance hit, the latency would be too long for our users. We shoot for a maximum ten second wait time when doing interactive analysis—anything longer is an unacceptable interruption to our users’ investigative flow. Studying the Percolator protocol, our engineers saw some places where design constraints could be relaxed to lower the latency of each operation.
And so, the idea for AtlasDB was born. Now it was just a matter of building it.
We take a human-centered design approach to building software. Instead of asking what technological problem we want to attack in isolation, we ask, “What features and infrastructure would a human analyst need to do his or her work?” To answer this question, we use a holistic understanding of how low-level data integration, scalable data servers, API layers, and an entire suite of user interface tools, when properly integrated, create an efficient, frictionless user experience for non-technical subject matter experts working with large-scale, disparate data to solve real-world problems. It’s an over-arching user experience (UX) play that decomposes into a lot of hard technical problems—similar to building something as complex and seamless as an iPhone.
When components already exist that serve our needs, we are happy to use them—our products use several of the high-quality open source datastores, map-reduce frameworks, and search engines. But we build new things whenever we identify a capability gap. Some examples of this in past:
In 2005, when we first started building Palantir Gotham, there wasn’t really a viable alternative to the RDBMS. The Revisioning Database, the Palantir Gotham persistent datastore, was originally an implementation of a special schema inside a SQL database. The SQL RDBMS performed well for our users until up to about a terabyte of data. But as the size and scope of Palantir Gotham-based analytic workflows pushed the database to its limits, there were only two available options if we stuck with a RDMBS:
We didn’t like either of these options, so we began considering non-RDBMS-based solutions. We started with a federated approach that let us address much larger scales of source data without scaling the core. We developed Palantir Phoenix, a petabyte-scale datastore that can run map-reduce and other batch-oriented processes to filter and classify data that needs to be reviewed by human analysts. By federating search and storage of massive-scale data out to Palantir Phoenix, and importing relevant results into Palantir Gotham on the fly, we could guarantee analysts would still have all the data they need at their fingertips without storing everything in the Palantir Gotham Revisioning Database.
For example, a cyber security workflow may include network flows data, proxy logs, malware scanning results, inventory data, authentication, and VPN logs. The vast majority of the data in these data sets are not of interest to cyber security analysts—they overwhelmingly represent legitimate traffic. But when something bad happens, such as malware being detected on a user’s laptop, analysts can pull the relevant and related narrow slices of data from Phoenix into Palantir Gotham to determine the extent and severity of the intrusion. Using our Dynamic Ontology, data is mapped into the Palantir object model and managed in three separate subsystems:
While the size of the data that ends up in Palantir Gotham can be much smaller than the total data size of an instance, it can still get pretty big. Moreover, it doesn’t help that all of the housekeeping Palantir Gotham does around the source of the data (e.g. the revisions and security information) requires us to store 2-5x more information than just the size of the initial imported source data.
It soon became clear that we were going to need to replace the datastore for the Revisioning Database. An obvious place to look for economical scalability was a class of datastores dubbed, ‘NoSQL’.
NoSQL datastores use collections of commodity-class machines working in concert, enabling engineers to build a distributed system capable of scaling up to large data volumes and high performance with a smooth price curve—add more machines, get more capacity. When we first built the Revisioning Database in 2005, NoSQL systems offered intriguing potential as an approach but were still plagued with performance, scale, and most importantly, data loss and corruption problems. In the intervening years, these early problems have largely been engineered away.
Today, these systems underlie much of the modern web and are developed and used by companies like Google, Facebook, and Amazon). Many of these use a key-value model (K/V), wherein a short, unique identifier (the key) is used as a key to access an individual storage cell (the value). The storage cell may hold a simple value or a larger, complex data structure.
While NoSQL stores have great scaling properties, they don’t make great guarantees about the consistency of the system. Since each node of the system runs independently and any given value could lie on any node, it’s impossible to know if any read of more than one node is consistent, i.e., came from the same write. For many uses (e.g., updating fields on a social network profile page), this property (called eventual consistency) is not a problem.
Unfortunately, for a system like Palantir Gotham, we’re not just storing individual values but sets of related values that need to read and write consistently, like many index entries along with a primary value. A lack of consistent read means that any operation that uses values from multiple keys can never be guaranteed to be correct.
Fortunately, implementing transactions can solve this problem by providing four guarantees, referred to as ACID:
Aside from the formal guarantees provided by transactions there is a very practical consideration: without transactions, programmers have to reason very carefully about consistency and write a lot of code to try to manage it. As a result, development proceeds slowly and the code is much more fragile to future changes. Pushing the consistency (and by extension, correctness) logic down into a transaction layer is usually the right separation of concerns.
The design of AtlasDB departs from Percolator in a few key aspects. By taking key locking out of the main datastore and into a dedicated lock server, the write overhead was lessened, increasing performance. Further improvements were gained by allowing the transaction accounting table to live in a datastore separate from the main scalable datastore. This decoupling allows the transaction data to live in a system that gains higher write performance in exchange for less scalability. Since the transaction accounting data is quite compact, this is a huge win for performance of the overall system. (We’ll cover the protocol and architecture in-depth in a later post.)
The NoSQL-transaction revolution still required a few more developments to make the burgeoning AtlasDB as engineer- and user-friendly as possible. Along with the core changes to the transaction protocol and system architecture, our team set about designing a system that could be used with almost any key-value store. Rather being tied to a particular key-value store, we decided to build an API layer that exposed transaction primitives. The API layer (effectively just a code library) along with a few lightweight network services created a system that could be applied to any key-value store. The writing of the driver for any new key-value store thus became a one-day task comprising, at most, a few hundred lines of code.
This is a good idea for a few reasons:
AtlasDB is now the default datastore inside of new production instances of Palantir Gotham and is already being re-purposed for other use cases that need both the scalability and consistency that it offers. Stay tuned for part two of the AtlasDB series, where we’ll do a deep dive into the system architecture and transaction protocol that make it go.