Blogs / Tech Blog

Pipes: using unix pipelines for beautiful answers to quick and dirty questions

/loony/bin

As we approach a release at Palantir we usually cut to a stable branch that QA can start testing as a release candidate. Further bug fixing and testing may continue on trunk by the developers, but we code review changes before committing them to the stable branch. As the time to really cut the release gets truly imminent we start asking questions like:

What changes are on trunk that are not in the stable branch?

We’re less concerned with what the changes are and more concerned with who owns the changes. What really want to know is:

Do the changes on trunk represent pending changes that should be moved to stable or are they further development that shouldn’t be put into the stable branch for this release?

For the most part, the person that can answer that question is the coder who made the changes on trunk. To that end, what we really would love to have would be a report of all files in trunk that differ from the stable branch and who last touched the file. There isn’t really an svn command that will do this succintly, so I started thinking about how to accomplish this. I had an inkling that it could be all solved with a single Unix pipeline and so I set out on my way to craft such a beast. Here’s what I came up with in about ten minutes:

for name in `diff -r --brief --exclude=.svn pgstable/src pgtrunk/src  | awk '{print $4}' | grep pgtrunk `; do
    author=`svn info $name | grep -E "Last Changed Author" | awk '{print $4}'`;
    echo $author    $name;
done | sort | sed 's/pgtrunk\/src\///' > difflist.txt

Which produces output that looks like this:

gbush com/palantir/foo/Bar.java
bclinton com/palantir/baz/Fargle.java

How did I come up with such a beast? I deconstruct this inscrutable wonder after the jump.

The first question that I’ll answer is: how do I know how to do this? I spend the vast majority of my days writing backend Java code for one of our enterprise products but it wasn’t always that way. In my last job before coming to Palantir, I was working as a senior systems administrator and my work email address was root@sourceforge.net. SourceForge.net is a complex site with a lot of Linux automation going on behind the scenes, and during the three years I was responsible for the infrastructure, I wrote a lot of sh scripts (which, of course, on Linux, is technically bash).

For those not familiar with Unix pipes, a quick overview is available here and the Wikipedia entry “Pipeline (Unix)” is also not a bad place to start.

So we start with this snippet:

`diff -r --brief --exclude=.svn pgstable/src pgtrunk/src  | awk '{print $4}' | grep pgtrunk `

First note that this three command pipeline is enclosed in backticks (that key that’s usually below the escape key on your keyboard that also has a ~ on it). In shell programming, this means, “execute this command in a subshell and substitute the subshells output here.”

The first command is diff -r --brief --exclude=.svn pgstable/src pgtrunk/src. This is command that actually does the diff. (Yes, diff will compute the differences between two directory trees). It produces output that looks like this:

Files pgstable/src/com/palantir/foo/Bar.java and pgtrunk/src/com/palantir/foo/Bar.java differ
Files pgstable/src/com/palantir/baz/Fargle.java and pgtrunk/src/com/palantir/baz/Fargle.java differ
Only in pgtrunk/src/com/palanrit/foo: NewFile.java

We then pipe this through awk, asking awk to only print the fourth field on the line, where fields are defined by the default delimiters of whitespace characters.

At this point, we would have output that looks like this:

pgtrunk/src/com/palantir/foo/Bar.java
pgtrunk/src/com/palantir/baz/Fargle.java
NewFile.java

We pipe this through grep and keep only the lines that match pgtrunk to filter out the new file case. We’re left with:

pgtrunk/src/com/palantir/foo/Bar.java
pgtrunk/src/com/palantir/baz/Fargle.java

You’ll note a caveat for would be cut and pasters: we’re ignoring the new file case. Any new file in trunk and not in stable is not going to show up here. This is one place where this quick script is not comprehensive, but it was sufficient for our needs at the time so I didn’t jump through the hoops to deal with that case.

So let’s expand our focus a bit to this snippet:

for name in `diff -r --brief --exclude=.svn pgstable/src pgtrunk/src  | awk '{print $4}' | grep pgtrunk `; do
...
done

You can see that we’re that the output of that first pipeline was substituted into a looping construct. The for name in wordlist; do ... done construct allows you to loop over a list of words that delimited by whitepace. In this case, it’s the the line-oriented output for the first pipeline, but it could also be a typed list of words. The shell will substitute each word in wordlist into the shell variable $name and then execute the list of commands between the keywords do and done.

The inner portion of the loop looks like this:

author=`svn info $name | grep -E "Last Changed Author" | awk '{print $4}'`;
echo $author    $name;

The first line sets the shell variables $author. The three command pipelines is parsing the output of svn info into a particular value and then using backtick substitution to set put the value into a variable. The output of svn info for a particular path looks like this:

Path: src/com/palantir/foo/Bar.java
URL: svn://svn/Trunk/
Revision: 14860
Last Changed Author: gbush
Last Changed Rev: 14860
Last Changed Date: 2006-10-10 00:39:53 -0700 (Tue, 10 Oct 2006)

So the pipeline is pulling out the username of the last committer on trunk for the path in $name and placing the value into $author.

Finally, we echo out that information on a single line, author first, path second, like this:

gbush pgtrunk/src/com/palantir/foo/Bar.java

And finally the, whole shebang is run through this command:

sort | sed 's/pgtrunk\/src\///'

sort will sort the output. Since we have put the usernames first on the line, this has the upshot of clustering all changes by username, giving each developer an easy-to-consult section in the email that gets sent out. The sed command is doing a regular expression search-and-replace that essentially strips out the leading part of the path, giving us just the raw relative path (to make the report easier to read). (Note that the backslashes in the regular expression replace pattern are there to escape the path elements of /, which are also used as delimiters in the replacement expression; in plain English, the sed 's/pgtrunk/src///' expression reads: replace the first occurrence of pgtrunk/src on every line with nothing.

Finally, > difflist.txt directs all output from the script into a file named difflist.txt.

I then used this to compose an email to the team, and soon stable and trunk were as in sync as they ever were going to be. And thus ends another exciting game of Clusenix.

Dr. Fun Clusenix Comic
Other Blogs