Distributed systems are complex. Getting them right is hard, and when things don’t go right, it can be difficult to understand what went wrong. In an environment like ours, a good monitoring system isn’t just nice to have; it’s a critical component necessary for understanding behavior and diagnosing problems.
We had three primary goals for the initial monitoring system: graphing of time-series data, alerting on event triggers, and notifications to users. Furthermore, as a product company, we had a design goal of a simple, intuitive (yet powerful and flexible) solution.
Before starting, we did a quick survey of existing open-source packages. Unfortunately, nothing we found quite fit our needs, given our specific requirements of security, protocol, licensing, and integrability into our product. Given that, we made the decision to forge ahead and build our own; we try not to re-invent the wheel but it seemed to make sense here.
For an in-depth look at the architecture of the Monitoring Server and components we used to build it, read on…
At the highest level, a two-tiered architecture made the most sense. The back-end, standalone server component would be responsible for collecting, processing, and exposing data through an API. The front-end component would be web-based portlets integrated into our existing management interface.
The server architecture was designed to allow generic components to work together, with everything connected up via Spring. While we started with JMX as our collection method for monitoring data, the architecture sees this as just one pluggable component, with multiple data backends supported. A Spring webservices API allows the front-end portlets to query and manipulate the components at each level.
For our first shipping release, we’ve only shipped the JMX backend, and so this is what production architecture looks like for now:
Any time you choose build instead of buy, there’s a lot of work to be done to get the full set of functionality you need. Fortunately, the Java platform has an extremely rich set of freely available projects and libraries, and we leveraged many of them for the back-end:
Having never used a component framework before, using Spring’s Inversion of Control and Dependency Injection paradigms to build an application turned out to be a pleasant and educational experience. While it enforced discipline in using interfaces, it rewarded us with the ability to easily swap implementations of a component. For example, switching to a HSQLDB-based data store required only a single-line edit, and everything just worked. Seriously.
We also leveraged Spring early in our development process: we pair-coded interfaces, created stub objects, and then wired everything up in Spring. Once our skeleton was in place, we independently worked on component implementations and swapped them in as they were completed. Later in the cycle, we used Spring in our unit tests to compose our application differently for specific tests, isolating important functionality and using dummy components for non-relevant areas.
By moving the user interface into the portlets, we were able to re-skin the fairly ugly native graphing capability that rrd4j provides with a more generic solution that looks good. For comparison, here’s an MRTG style graph produced by rrd4j:
And here’s some graphs from our Monitoring Server (note the portlet UI components for controlling display of the graphs):
While the difference is not that stark, our graphs are much easier on the eyes.
We recently released the monitoring system, and it’s already providing insights into our product’s behavior. We have more features planned: eventing, which will help us track system events such as a server restart or job completion; generating new time-series data from existing data (for example, a series of the rolling standard deviation of a metric, or the number of failure events in the past 24 hours), and Groovy scripting directly against the monitoring server. The last feature is particularly helpful when our engineering team can’t physically access a system due to security restrictions.
From an analysis perspective, we can now start to better understand our system’s behavior, which will help us identify problems before they occur and help steer our development energy going forward. Even the world’s best data analysis software needs a little analysis itself sometimes.