At Palantir, we work in Silicon Valley, read High Scalability, and think of web companies like Facebook and Google as our peers. Most of the time, this is exactly the right recipe for bringing disruptive innovation into the intelligence community. Sometimes, though, it’s misleading – when discussing a design decision, it’s received knowledge that “Disk is cheap.” or “CPU is cheap”. For a web company with a deployment in a commercial data center (or its own data center), this received knowledge is correct. But for a company that ships distributed systems instead of hosting them, and for whom the deployment environment is the kind of locked-down server room in which classified data can reside, these assumptions couldn’t be more false.
At Palantir, we are almost never able to host our customers’ data – typically, as the data is very sensitive, we are not even allowed to see it! Our customers’ highly sensitive data has to reside in a Secure Compartmented Information Facility or SCIF – a building which has been built to be resistant to attempts to access the information within, whether through active or passive measures. The network inside a SCIF is physically separated – “airgapped” – from the public Internet to prevent information leakage. As the entire rationale for such facilities is to prevent information leakage, moving information into or out of one is a tightly regulated process, almost always requiring a human to be in the loop.
Bandwidth is narrow
Bandwidth in and out of a data center is cheap. Bandwidth in and out of a SCIF is not – and this manifests in surprising ways. First off, what does it take to get data into a SCIF? First, the data has to be downloaded from wherever it’s hosted and burned to a CD. Then, someone has to carry it into the SCIF and find a security officer to approve adding it to the network. Finding the security officer can take anywhere from 10 minutes to an entire day. Once you’ve found the security officer, he has to run a virus scan on the CD, which can run at a rate of roughly 20 minutes per 100MB.
If you look at the entire process, you can model our connection into the SCIF as averaging about an 8 hour latency and 640 Kbps bandwidth. That’s about the bandwidth of a slow DSL line and the latency of a radio connection to Pluto. (Actually, it’s somewhat slower.) There’s also a big non-linearity at 700MB, which is the amount of data that fits on a single CD. For instance, this non-linearity is the big reason why we prefer to send patches to our customers rather than full distributions, which are slightly less than a gigabyte including dependencies – and thus why it’s worth it to us to build a system for automating patch application rather than simply replacing jar files by hand.
Disks are expensive
Similarly, if you are running a data warehouse, disk is cheap. You can buy a 1 TB, 7200 RPM disk for about $100, which is perfect for the kind of large, serial reads or writes that a data warehousing workflow requires. However, Palantir uses disk for our database and our search engine, both of which have an OLTP-style usage pattern. As opposed to a data warehouse access pattern, which emphasizes full table scans, OLTP emphasizes random access and therefore requires fast disk. To get 1TB at 15k RPMs costs about $1000, and requires a disk array rather than a single disk. In order to keep the disk fast, you also want to leave it only about 20% full, which overall makes fast disk about 50 times more expensive than slow disk. Most importantly, however, installing a disk array requires trained personnel, a special approval process, and reconfiguring the system to use the new disks, which is a fairly complicated and error-prone process.
CPUs are hot
Finally, in a commercial data center, CPU is the cheapest resource of all. In a secure server room, however, it can be quite expensive. Each CPU or additional box requires more power and cooling. If the room is nearly full, adding that extra box may require building out an entirely new server room, which can cost months and hundreds of thousands of dollars just for an office building. Building a server room in a SCIF is much more expensive and prohibitively time-consuming.
RAM to the rescue
On the other hand, some things in a SCIF are comparatively cheap. We never use boxes with less than 32GB of memory, and, in fact, lots of sites use 128GB of memory. RAM requires negligible power and cooling, and compared to disk, it’s relatively simple to install. It’s also easy to reconfigure the setup to use the additional memory.
The design guidelines that follow from this are simple: build a system that is as autonomous as possible and scales down as well as it scales out.
All these statistics are compiled from our day-to-day experiences in the office environment of a SCIF. Deploying to soldiers in the field makes the issues involved in deploying to a SCIF seem minor. Of course, that’s what makes what we do fun.