Server-Side Quality Engineering: Exploring Software From the Inside Out

When looking for a job after graduation, many computer science or engineering majors assume that software development is their only career path. At least that’s how I felt at first. After all, my curriculum was dominated by courses on one or another aspect of software development. But when considering my career options, I realized I found coding interesting because I could use it to manipulate my computer, not because I found coding fascinating in and of itself. For example, I’d rather write code to organize the files in my music collection than write code to create a new filesystem. I routinely wrote tools to fetch and compile some cool new open-source project, but I was never especially interested in adding more features to those projects. When my network router broke at home, I enjoyed learning tools like ping, netstat, and tcpdump, but I didn’t really want to extend the core functionality of those utilities.

Those things that really interest me tend to be smaller in scope than full development projects. Once I’ve explored a tool or piece of code to solve a problem, I’m pretty much ready to move on to the next challenge. I’m like an explorer who is always searching for something new to learn and to satisfy my innate curiosity.

Server-Side Quality Engineers (SSQEs) at Palantir are explorers for these same reasons. We’re interested in exploring things like distributed systems, Linux servers, and databases. We want to learn how they work so that we can manipulate things around them.

For example,

  • Knowing how Palantir Gotham deals with data-scale (sharding) and user-scale (mirroring) allows us to configure clusters that resemble customer deployments.
  • Knowing how Linux behaves during entropy starvation (and being able to find the cause) allows us to efficiently use our server hardware.
  • Knowing how Oracle handles various SQL statements allow us to spot slow database performance.

If my experience resonates with you at all, read on to learn about the role of SSQEs during various stages of the Palantir software development cycle.

Feature Vetting and System Architecture Review

Before each iteration of the Palantir software development cycle, we spend a week with our Software Engineer counterparts to understand what they are building. We want to know what features they will be implementing, how they plan to implement those features, and which customers those features are targeting. Developers should know the answers to questions like “how many users does this feature support simultaneously?”, “does it interact with any existing features?”, “what if the system crashes while this operation is still running?”, “is this operation idempotent if someone runs it twice?”, and so on. We will also talk to our Forward Deployed Engineer counterparts (also known as BD or Business Development) to make sure the use cases we envisioned match real customer use cases. This is also a good time to discuss customer hardware configurations, special data characteristics, and other deployment-unique cases.

When necessary, we ask how a particular piece of system architecture, protocol, or algorithm works to ensure that we understand how to test and manipulate it. We also try to discover non-obvious corner cases early in the cycle. Since the best idea always wins at Palantir, planning week is the ideal time to ask any (and a lot of) questions about a feature or a system. Making changes at this point is MUCH cheaper than later in the cycle.

New Feature Testing

Explorers of old filled their notebooks with detailed drawings of mountains, rivers, plants, and animals. Like those explorers, we write down our new-feature discoveries into our version of those notebooks (test plans) so that others can follow in our testing footsteps. Test plans initially contain information about how a feature works, for whom it was built, and other detailed notes from planning week.

As new feature development progresses, developers hand off their code in discrete milestones. We test each milestone and further refine the test plan. For example, in the test plan, Quality Engineers describe how to set up a feature, propose reasonable data sets, write testing instructions, and record expected results. Developer milestone hand-offs and testing of the milestones continue in this passing-of-the-ball fashion until the end of the new feature period.

System Debugging

During new feature testing, we expect the Palantir Gotham application to be unstable as many developers check in code simultaneously. One of our duties as Quality Engineers is to report any unexpected behaviors as bug reports so that developers can fix them. Quality Engineers at some other companies call their job done at this point. Palantir Quality Engineers take the extra steps to diagnose whether an issue comes from the system (Linux settings, Oracle configurations, network issues, CPU, RAM, IO contention, etc.) or from the product itself. This extra bit of effort increases quality in the bug report, and as the bugs are resolved, in the product.

But how do SSQEs know what to do to debug these things? Well, most SSQEs needed to do this type of debugging in the early days of Linux in order to even use the operating system, so they’ve gained tons of experience doing it. Since hardly anyone else was using Linux back then, the early pioneers frequently found problems that no one else has ever come across. Today, using Linux is easier than ever before. On the negative side, unless you’re naturally really curious, you probably haven’t been exposed to this type of debugging because using Linux is so easy. Explorer tip: If you want to bulk up your Linux knowledge to give yourself an edge during the interview process, read ‘How Linux Works: What Every Superuser Should Know’ by Brian Ward.

Exploring the Server Debug Challenge

If the above sounds interesting or even exciting to you, let’s talk about how you can join the SSQE explorers. We begin the process by giving you an opportunity to explore a real system. This is the Server Debug Challenge. While this challenge is optional, it presents a realistic system debugging scenario that gives you a chance to show off your exploration skills. It’s also a way for you to set yourself apart from the many applications that we receive. If your report details the steps you took to correctly find the problem, we definitely want to talk to you. Whatever the outcome, we’re pretty sure you’ll have fun playing with it.