Palantir Monitoring Server: where build beats buy

Graph of CPU usage over time
Distributed systems are complex. Getting them right is hard, and when things don’t go right, it can be difficult to understand what went wrong. In an environment like ours, a good monitoring system isn’t just nice to have; it’s a critical component necessary for understanding behavior and diagnosing problems.

We had three primary goals for the initial monitoring system: graphing of time-series data, alerting on event triggers, and notifications to users. Furthermore, as a product company, we had a design goal of a simple, intuitive (yet powerful and flexible) solution.

Before starting, we did a quick survey of existing open-source packages. Unfortunately, nothing we found quite fit our needs, given our specific requirements of security, protocol, licensing, and integrability into our product. Given that, we made the decision to forge ahead and build our own; we try not to re-invent the wheel but it seemed to make sense here.

For an in-depth look at the architecture of the Monitoring Server and components we used to build it, read on…

Architecture

At the highest level, a two-tiered architecture made the most sense. The back-end, standalone server component would be responsible for collecting, processing, and exposing data through an API. The front-end component would be web-based portlets integrated into our existing management interface.

The server architecture was designed to allow generic components to work together, with everything connected up via Spring. While we started with JMX as our collection method for monitoring data, the architecture sees this as just one pluggable component, with multiple data backends supported. A Spring webservices API allows the front-end portlets to query and manipulate the components at each level.

For our first shipping release, we’ve only shipped the JMX backend, and so this is what production architecture looks like for now:

Monitoring Server architecture diagram

Components

Any time you choose build instead of buy, there’s a lot of work to be done to get the full set of functionality you need. Fortunately, the Java platform has an extremely rich set of freely available projects and libraries, and we leveraged many of them for the back-end:

  • JMX: the core of our system, the Java Management Extensions is a standard for managing and monitoring applications. We use JMX to instrument and monitor our own servers, and because it’s an adopted standard, we gain access to MBeans exposed by third-party components as well.
  • rrd4j: round-robin databases (RRDs) are an excellent storage format for time-series data, and RRD4J is a pure Java implementation of the legendary RRDTool. The round-robin format allows for a fixed size file, since older data is overwritten as newer data arrives. The multi-resolution aspect of the files provides long historical views without a space premium. For example, an RRD can contain a high resolution series for recent information and a low resolution series for long-term data.
  • HSQLDB: a lightweight, native Java, SQL database that can be run in-process. We use HSQLDB to store all non–time-series information, such as metadata about metrics we’re monitoring.
  • Quartz: an open source job scheduling system, we use Quartz primarily for scheduling Alerts. Alerts run periodically to check for a condition, and notify if triggered. Each Alert’s wait period is specified by the user, and fortunately, with Quartz it’s easy to schedule many Alerts at different frequencies.
  • Groovy: self-described as “an agile dynamic language for the Java Platform,” Goovy is integrated into our alerting system. Alerts can contain Groovy scriptlets, which give us the expressiveness to create Alerts such as “alert if a metric’s average value over the past 5 minutes is greater than X,” or “alert if the variation of a set of metrics’ values across all servers of type Y is greater than Z.”
  • JavaMail: a full-featured email framework. Supports SSL/TLS secure connection protocols, which our clients require.
  • JAXB: a simple-to-use Java to XML API, JAXB allows us to convert XML into Java objects (and vice-versa). We use JAXB for parsing configuration files and persisting objects into HSQLDB.
  • Spring: a framework for developing enterprise Java applications, Spring is the foundation for our monitoring server.

Having never used a component framework before, using Spring’s Inversion of Control and Dependency Injection paradigms to build an application turned out to be a pleasant and educational experience. While it enforced discipline in using interfaces, it rewarded us with the ability to easily swap implementations of a component. For example, switching to a HSQLDB-based data store required only a single-line edit, and everything just worked. Seriously.

We also leveraged Spring early in our development process: we pair-coded interfaces, created stub objects, and then wired everything up in Spring. Once our skeleton was in place, we independently worked on component implementations and swapped them in as they were completed. Later in the cycle, we used Spring in our unit tests to compose our application differently for specific tests, isolating important functionality and using dummy components for non-relevant areas.

User Interface

By moving the user interface into the portlets, we were able to re-skin the fairly ugly native graphing capability that rrd4j provides with a more generic solution that looks good. For comparison, here’s an MRTG style graph produced by rrd4j:

rrd4j sample graph

And here’s some graphs from our Monitoring Server (note the portlet UI components for controlling display of the graphs):

Graphs from Monitoring Server

While the difference is not that stark, our graphs are much easier on the eyes.

Monitoring Server: present and future

We recently released the monitoring system, and it’s already providing insights into our product’s behavior. We have more features planned: eventing, which will help us track system events such as a server restart or job completion; generating new time-series data from existing data (for example, a series of the rolling standard deviation of a metric, or the number of failure events in the past 24 hours), and Groovy scripting directly against the monitoring server. The last feature is particularly helpful when our engineering team can’t physically access a system due to security restrictions.

From an analysis perspective, we can now start to better understand our system’s behavior, which will help us identify problems before they occur and help steer our development energy going forward. Even the world’s best data analysis software needs a little analysis itself sometimes.