At Palantir, we write software that gets deployed at each client, integrated across their sensitive data sets, and maintained and administered by that client’s in-house admins. Most deployed enterprise software is run on a single beefy box: consider wikis, blogging systems, bug tracking systems, or practically any client/server or web client software software used today. On the other hand, most enterprise software that runs as a distributed system is hosted: Salesforce.com, Google Apps, or any approach that sells software as a service. What’s fairly unusual about our software is that it’s deployed as a distributed system at each client.
Distributed systems are hard to build and hard to maintain. As long as that distributed system is built and maintained in-house, however, you have a number of advantages:
- The administrators are full-time product experts who are focused on the mission of keeping your system available and responsive.
- The development organization can build internal tools for the administrators that only have to be “good enough” and can step in if necessary.
- It’s easy to get feedback on how the system performs, because there are no sensitivity, privacy, or legal constraints.
- A single, large deployment allows you to optimize your hardware purchasing and amortize installation headaches across a large number of machines.
This is all great, of course, and if you can host and maintain your distributed system yourself, I’d highly recommend it. Sometimes, however, it’s just not possible. At Palantir, the client data we work with is so sensitive that even we cannot see it, except under very strictly controlled circumstances. It’s also so large that the bandwidth limitations of pushing it into a system hosted by us would be prohibitive.
So suppose that you have to deploy your distributed system in a customer datacenter with external parties maintaining the system. What do you need to consider? In this post, I’ll go into a number of key points that we have faced and addressed at Palantir.
Understand Your Administrators
Assume that your administrators are part-time, not product experts, and constantly distracted by their other responsibilities. They aren’t even experts in the technologies your system is based on: for example, they don’t really know much about databases and they are more comfortable with Windows than with Linux. Even if these assumptions aren’t all true in any particular case, there will be administrators who meet each of these assumptions.
Design For Manageablility
This means building powerful management tools for your system that are web-based and also scriptable. Remember that your administrators are part-time, so usability is important: by the time your administrator touches the Foobar Configuration Widget the second time, he’s forgotten everything he learned a month ago when he did it the first time. You also want to build management tools that go all the way down the stack: using low-level tools for occasional jobs leads to mistakes, because those low-level tools tend to be far more powerful than necessary for your system.
Design In Monitoring And Notification
Visibility is one of the biggest reasons people want control – but unnecessary control leads to mistakes. Your administrator shouldn’t have to go to the command line to run
top just to figure out whether the system is overloaded. Each metric that is being monitored needs to have historical data so that your system can distinguish baseline behavior from anomalies. Each metric displayed to an administrator needs to be displayed with context, whether that’s the mean and standard deviation of the metric, or whether it’s similar metrics on other servers. Anomalous behavior should trigger human action through a notification.
Notifications also need to be carefully designed (as well as extensible by the administrator). Carefully distinguish actionable items from non-actionable items, and try to reduce ambiguity as to what action is required. It’s similar to error logging: if you let standard system events pollute your error logs, the administrator will soon stop paying attention to them. Although you may not be able to send monitoring information directly back to the development team, you may also want to prepare reports of what’s gone wrong to collect every so often; just make sure that these reports are human-readable so that they can be vetted to make sure they don’t leak any sensitive information.
Design For Autonomy
Where possible, design the system to handle error conditions that can be systemically fixed. The best kind of failure is one that requires no admin intervention. If you can automatically extend your tablespace when it runs out of allocated space, do it. But be sure to give sufficient warning if a long lead-time action is going to be required (like ordering and installing an additional disk). You won’t be able to figure out everything that can go wrong ahead of time, but you can iterate to drive down the number of events for which human intervention is required.
In future posts, we plan to drill down on each of these challenges and look at what approaches and technologies worked for us.