AtlasDB: Transactions for Distributed Key-Value Stores (Part I)

AtlasDB is a massively scalable datastore and transactional layer that can be placed on top of any key-value store to give it ACID properties. This is the first of several blog posts that will introduce AtlasDB and describe how we built it at Palantir.

Building AtlasDB: the Inspiration

In 2010, Daniel Peng and Frank Debek of Google put out a paper entitled, Large-scale Incremental Processing Using Distributed Transactions and Notifications. The paper describes a system in use at Google named Percolator, which sits on top of BigTable, Google’s distributed key-value store.

Google needed transactions with ACID properties. They also needed a highly fault-tolerant system to deal with the potential failure of key parts of their system under load at massive scales. This drove them to push the accounting data for the transactions into BigTable as well, as BigTable already handled both replication of its data and fault tolerance.

Unfortunately, due to the number of writes involved in the transaction accounting, they saw an approximately 75% performance hit when using transactions. Percolator was built to enable incremental updates to the search indexes for the Google search engine, which was previously a periodic batch process. In this case, the extra reliability afforded by using BigTable to track the transactions was the important factor; even though there was a significant performance hit (over using raw BigTable), the overall performance was high enough to meet its design criteria.

Meanwhile, at Palantir, we were hitting a similar obstacle. The interactive analytic core of Palantir Gotham, which was originally built with a traditional RDBMS as its backing store, was hitting the limits of economical and easy scaling. We needed to move to a distributed backing store for scalability, but we needed transactions to enable our higher-level Revisioning Database to work correctly.

Percolator presented interesting possibilities, but with a 75% performance hit, the latency would be too long for our users. We shoot for a maximum ten second wait time when doing interactive analysis—anything longer is an unacceptable interruption to our users’ investigative flow. Studying the Percolator protocol, our engineers saw some places where design constraints could be relaxed to lower the latency of each operation.

And so, the idea for AtlasDB was born. Now it was just a matter of building it.

Designing AtlasDB

Understanding Data and Development at Palantir

We take a human-centered design approach to building software. Instead of asking what technological problem we want to attack in isolation, we ask, “What features and infrastructure would a human analyst need to do his or her work?” To answer this question, we use a holistic understanding of how low-level data integration, scalable data servers, API layers, and an entire suite of user interface tools, when properly integrated, create an efficient, frictionless user experience for non-technical subject matter experts working with large-scale, disparate data to solve real-world problems. It’s an over-arching user experience (UX) play that decomposes into a lot of hard technical problems—similar to building something as complex and seamless as an iPhone.

When components already exist that serve our needs, we are happy to use them—our products use several of the high-quality open source datastores, map-reduce frameworks, and search engines. But we build new things whenever we identify a capability gap. Some examples of this in past:

  • The Revisioning Database, the heart of Palantir Gotham that enables Nexus Peering
  • Nexus Peering, a technology that allows a single Palantir Gotham instance to be distributed, or for multiple instances to securely and safely exchange data
  • The Maps application, which allows the viewing of geospatial imagery and also the visualization and analysis of objects with associated coordinates
  • Horizon, an in-memory database that drives interactive querying over billions of objects in interactive time, used to back the Object Explorer application.

Scaling, Round 1: Federation and Palantir Phoenix

In 2005, when we first started building Palantir Gotham, there wasn’t really a viable alternative to the RDBMS. The Revisioning Database, the Palantir Gotham persistent datastore, was originally an implementation of a special schema inside a SQL database. The SQL RDBMS performed well for our users until up to about a terabyte of data. But as the size and scope of Palantir Gotham-based analytic workflows pushed the database to its limits, there were only two available options if we stuck with a RDMBS:

  • Get a larger, more powerful computer. This works, but the price of computer hardware and advanced database software needed to support that scale grows super-linearly (sometimes exponentially) with the size of the data, making this approach really expensive, really fast.
  • Move to multiple computers and a sharded database architecture. While this can work well for certain database schema, our schema is not well-suited to this approach. Sharding can also add a lot of complexity to the application using it, leading to a more bug-prone and fragile code base.

We didn’t like either of these options, so we began considering non-RDBMS-based solutions. We started with a federated approach that let us address much larger scales of source data without scaling the core. We developed Palantir Phoenix, a petabyte-scale datastore that can run map-reduce and other batch-oriented processes to filter and classify data that needs to be reviewed by human analysts. By federating search and storage of massive-scale data out to Palantir Phoenix, and importing relevant results into Palantir Gotham on the fly, we could guarantee analysts would still have all the data they need at their fingertips without storing everything in the Palantir Gotham Revisioning Database.

For example, a cyber security workflow may include network flows data, proxy logs, malware scanning results, inventory data, authentication, and VPN logs. The vast majority of the data in these data sets are not of interest to cyber security analysts—they overwhelmingly represent legitimate traffic. But when something bad happens, such as malware being detected on a user’s laptop, analysts can pull the relevant and related narrow slices of data from Phoenix into Palantir Gotham to determine the extent and severity of the intrusion. Using our Dynamic Ontology, data is mapped into the Palantir object model and managed in three separate subsystems:

  • a search server for fast full-text and wildcard searching;
  • a Horizon server for top-down filtering and comparisons of large sets of objects;
  • the Revisioning Database for tracking changes to the data and allowing analysts to work independently while also sharing analytic insights with each other. (This is also where the metadata that enables Palantir’s privacy and security controls is stored.)

While the size of the data that ends up in Palantir Gotham can be much smaller than the total data size of an instance, it can still get pretty big. Moreover, it doesn’t help that all of the housekeeping Palantir Gotham does around the source of the data (e.g. the revisions and security information) requires us to store 2-5x more information than just the size of the initial imported source data.

Scaling, Round 2: NoSQL K/V stores

It soon became clear that we were going to need to replace the datastore for the Revisioning Database. An obvious place to look for economical scalability was a class of datastores dubbed, ‘NoSQL’.

NoSQL datastores use collections of commodity-class machines working in concert, enabling engineers to build a distributed system capable of scaling up to large data volumes and high performance with a smooth price curve—add more machines, get more capacity. When we first built the Revisioning Database in 2005, NoSQL systems offered intriguing potential as an approach but were still plagued with performance, scale, and most importantly, data loss and corruption problems. In the intervening years, these early problems have largely been engineered away.

Today, these systems underlie much of the modern web and are developed and used by companies like Google, Facebook, and Amazon). Many of these use a key-value model (K/V), wherein a short, unique identifier (the key) is used as a key to access an individual storage cell (the value). The storage cell may hold a simple value or a larger, complex data structure.

While NoSQL stores have great scaling properties, they don’t make great guarantees about the consistency of the system. Since each node of the system runs independently and any given value could lie on any node, it’s impossible to know if any read of more than one node is consistent, i.e., came from the same write. For many uses (e.g., updating fields on a social network profile page), this property (called eventual consistency) is not a problem.

Unfortunately, for a system like Palantir Gotham, we’re not just storing individual values but sets of related values that need to read and write consistently, like many index entries along with a primary value. A lack of consistent read means that any operation that uses values from multiple keys can never be guaranteed to be correct.

Fortunately, implementing transactions can solve this problem by providing four guarantees, referred to as ACID:

  • Atomicity - every distinct operation of a transaction succeeds or the state is rolled back as if the transaction never happened; there is no way to partially complete the update of multiple fields
  • Consistency - data is in a consistent state at the beginning and at the end of a transaction
  • Isolation - the work taking place inside a transaction is invisible to any other operation taking place, so two transactions can be run concurrently without interfering
  • Durability - once a transaction commits successfully, the data must have been written to non-volatile storage in such a way that it won’t be lost in the event of a crash or power failure

Aside from the formal guarantees provided by transactions there is a very practical consideration: without transactions, programmers have to reason very carefully about consistency and write a lot of code to try to manage it. As a result, development proceeds slowly and the code is much more fragile to future changes. Pushing the consistency (and by extension, correctness) logic down into a transaction layer is usually the right separation of concerns.

Setting the Stage for AtlasDB: A Transactional API for Key-Value Stores

AtlasDB Architecture

The design of AtlasDB departs from Percolator in a few key aspects. By taking key locking out of the main datastore and into a dedicated lock server, the write overhead was lessened, increasing performance. Further improvements were gained by allowing the transaction accounting table to live in a datastore separate from the main scalable datastore. This decoupling allows the transaction data to live in a system that gains higher write performance in exchange for less scalability. Since the transaction accounting data is quite compact, this is a huge win for performance of the overall system. (We’ll cover the protocol and architecture in-depth in a later post.)

The NoSQL-transaction revolution still required a few more developments to make the burgeoning AtlasDB as engineer- and user-friendly as possible. Along with the core changes to the transaction protocol and system architecture, our team set about designing a system that could be used with almost any key-value store. Rather being tied to a particular key-value store, we decided to build an API layer that exposed transaction primitives. The API layer (effectively just a code library) along with a few lightweight network services created a system that could be applied to any key-value store. The writing of the driver for any new key-value store thus became a one-day task comprising, at most, a few hundred lines of code.

This is a good idea for a few reasons:

  1. Deployment flexibility - An application built on top of AtlasDB would always see the same API, allowing the key-value store to be switched out for different levels of scale. Palantir Gotham needs to run at scales as small as a single laptop, both as a development environment for creating customizations and enhancements and in order to support disconnected, self-contained operation for environments that don’t have any networking infrastructure. It also needs to operate at petabyte scale for large production systems. Key-value stores that run at massive scale are usually somewhat complex to setup and administer; the easy-to-set-up key-value stores don’t scale.
  2. Pace of change in NoSQL - Though much more advanced than it was a decade ago, the NoSQL world has not yet reached full maturity. The capabilities, scale, and performance of different offerings is still rapidly evolving. Keeping AtlasDB free from committing to a single key-value store is essential, given the uncertainty around what the best options will be even a year or two from now.
  3. Consistent API - The bulk of the team’s work in completing AtlasDB was not building the transactional layer itself, but in porting our extensive, existing data fusion platform over from SQL to a key-value model of data storage. By abstracting the API into the AtlasDB transactional layer, we are preventing having to port the entire product to yet another API in the future—switching datastores is as easy as writing a new K/V driver.

AtlasDB today

AtlasDB is now the default datastore inside of new production instances of Palantir Gotham and is already being re-purposed for other use cases that need both the scalability and consistency that it offers. Stay tuned for part two of the AtlasDB series, where we’ll do a deep dive into the system architecture and transaction protocol that make it go.