Friday, May 20. 2011

Nagios NSCA from Python

I've been working on improving the monitoring of the build slaves at Mozilla. As part of this project, I needed to be able to submit passive check results to the Nagios servers via NSCA during system startup. I'm doing this from a Python script that needs to run on a wide array of systems using whatever random Python is available. We run some oddball stuff, so the common denominator is Python 2.4.

It turns out that there's no Python NSCA library, although there is Net::Nsca in Perl. So, I wrote one, and put it on github: https://github.com/djmitche/pynsca.

At the moment, this only knows XOR, and only does service checks. That's all I need, but hopefully it can be easily expanded to cover other purposes. The one thing I want to avoid is adding mandatory requirements -- this should work, at least in plain-text and XOR modes, on a plain-vanilla Python installation.

By the way, the startup script I'm working on is runslave.py, which includes a modified copy of pynsca and does a number of other housekeeping jobs as well. More on that in a subsequent post.

Saturday, January 22. 2011

Amanda's Transfer Mechanisms

There's been a bit of confusion on the mailing list and IRC about how Amanda assembles transfers out of transfer elements, and how transfer mechanisms influence that.

In the final form of a transfer, any two adjacent elements must have the same mechanism. For example is, an upstream element speaking XFER_MECH_PUSH_BUFFER cannot talk to a downstream element using XFER_MECH_READ_FD (nor, more confusingly, XFER_MECH_PULL_BUFFER). So each mechanism is an isolated definition of "here's how upstream and downstream should talk". They come in pairs because generally anything upstream can do to downstream (e.g., upstream can write to downstream's fd) can occur in reverse (e.g., downstream can read from upstream's fd).

What makes this confusing is that if you specify a set of elements which can't talk directly to one another, then xfer.c will add "glue" elements between the specified elements. To make that concrete, imagine you specify a transfer as

source-holding --> filter-xor --> dest-fd
(if you like practical examples, then imagine filter-xor is a buffer-based decompression filter, and you're pulling data from holding disk, decompressing, and sending to a pipe -- something amfetchdump would do). Here are the mechanisms supported by each element:
source-holding:
 XFER_MECH_PULL_BUFFER
filter-xor
 XFER_MECH_PULL_BUFFER (input) and
 XFER_MECH_PULL_BUFFER (output)
or
 XFER_MECH_PUSH_BUFFER (input) and
 XFER_MECH_PUSH_BUFFER (output)
dest-fd
 XFER_MECH_WRITEFD (input)

In putting these together, source-holding and filter-xor can use the same mechanism (PULL_BUFFER). This leaves filter-xor using PULL_BUFFER for output, but dest-fd does not support this. So xfer.c adds a glue element that can speak PULL_BUFFER on input and WRITEFD on output. This element basically loops in a thread, calling upstream->pull_buffer and write(downstream->input_fd, buffer). So the final xfer looks like

 source-holding --(PULL_BUFFER)--> filter-xor --(PULL_BUFFER)--> glue --(WRITEFD)--> dest-fd

Hopefully that helps to explain how the glue works.

Note that one of the cool things about this arrangement is that in most cases the complexity is in the glue, not the elements. In fact, in this case the glue provides the only thread that's required to run this transfer, so the other three element implementations don't need to manage threads at all.

Thursday, November 11. 2010

virtualenv for Perl

I absolutely love virtualenv for Python development. It allows me to develop Buildbot against several versions of Python and several versions of its dependencies, without modifying my system's Python installation at all!

Now, I need to do the same thing in Perl. So I thought I'd compare the two side-by-side.


Continue reading "virtualenv for Perl"

Friday, July 16. 2010

IPv6 and Amanda

Amanda joined the IPv6 revolution in November 2006 - all of the BSD-style authentication mechanisms can support IPv6 endpoints. However, it's generally agreed that this was a mistake, and in this post I will talk about why that's the case.


Continue reading "IPv6 and Amanda"

Thursday, July 8. 2010

What's New in Amanda: The End of Fragmentation

Most of my posts in this series have been about features that are available in a released version of Amanda. This time, I want to share a project I'm working on right now - one that will be available in Amanda-3.2. I'm reworking the way Amanda writes its data to tape (or any other kind of storage) to make it more efficient, more reliable, and simpler to configure.

Historically, Amanda's conservative approach to finicky tape hardware has meant that it wasted some space at the end of each tape. With the changes I'm working on, Amanda will no longer waste this space, and can also avoid some needless copying of data in most cases, with a minimum of additional risk.


Continue reading "What's New in Amanda: The End of Fragmentation"

Thursday, July 1. 2010

What's New in Amanda: Hackability

It's been a while since I've posted about recent development in Amanda, but it's not for lack of interesting topics!

Today I want to talk a little bit about Amanda's development. Historically, Amanda has always had a small, core group of developers who do the lion's share of the development work. There are probably lots of reasons for this, not least of which is that a backup application isn't the sexiest project on which to spend your spare time. But I think there's a deeper reason, and it has to do with hackability.


Continue reading "What's New in Amanda: Hackability"

Monday, April 12. 2010

Modern Multiprocessing

I've been thinking a lot lately about the way we accomplish multiprocessing. We've seen a significant change in the operation of Moore's law for CPU speeds: today's CPUs are about the same speed as those of a few years ago, but they have more cores, and more virtual processors on those cores. This is great for heavily-loaded servers, which have plenty of distinct tasks to place on those cores and VCPUs, but not so useful for users working with single-threaded applications.

Why are most applications still single-threaded? There are lots of good reasons. Threaded code is harder to write, and not just because it requires careful analysis and use of synchronization primitives: many common tasks are difficult to meaningfully parallelize without careful control over inter-thread communication, and in a portable application you don't have that kind of control. Threaded code generally performs badly on single-CPU systems, which are still common. Some popular languages still make threading difficult, at least in a portable fashion. And threads are still relatively heavyweight entities in most operating systems: you don't spawn ten threads to mergesort a 100-item array.

Some of these problems will go away with a little more time, but some will get worse. NUMA architectures can make sharing data between threads slow. Hyperthreading and its interaction with processor caches adds yet another level of unpredictability.

We know how to build massively parallel systems that run massively parallel algorithms. What is still unknown is how to build portable, simple software that can run efficiently across a vareity of architectures. This is a problem of practice, not theory, and there's lots of interesting work going on in this area.

Of course, there are languages designed explicitly to support communication, such as Limbo or Erlang, Haskell, and Clojure. For the most part, these languages are structured as communicating sequential processes, which is to say that they represent multiprocessing as a set of sequential threads that pass information to one another. Problems of thread safety are subsumed by the languages, but mapping the parallelism to available resources is generally left to the programmer or administrator.

One interesting project is Apple's Grand Central Dispatch. It defines a simple but highly expressive closure syntax (a block) and a mechanism to dynamically schedule execution of such closures (queues). Critically, the GCD library takes care of scaling the parallelism of the queue processing appropriately to the underlying hardware. On a single-threaded CPU, this amounts to cooperative multitasking, but on parallel hardware the operating system can dynamically allocate virtual CPUs to applications needing more parallelism.

This topic seems to come up often in my various pursuits, so I will return to it again.

Want to work on Amanda?

I've not made any secret of the fact that I want more people hacking on Amanda. This is both for selfish reasons -- many hands make light work -- and for altruistic reasons -- a broader community of developers can provide better governance for the project and long-term continuity. With a few noticable exceptions, I haven't had a lot of satisfaction.

I think part of the reason is that Amanda has a steep learning curve, even within the new Perl code. The time to climb that curve is a big investment, and folks with only a small itch to scratch can't afford it.

In an effort to sweeten the pot, we (Zmanda) are offering to pay for flexible work on Amanda. Part-time or full-time, on your own schedule. Your choice of projects. Support and gratitude from the other hackers. And the option to become a full Zmanda employee if that's your bent.

Here are some possible projects, to pique your interest:

  • MySQL application (to round out the set with ampgsql)
  • Cyrus Imapd application (gnutar doesn't deal well with the application's tiny files and hard links)
  • OpenSSL for network transport, using certificates and keys for authentication
  • Database-backed backup catalog
  • Amvault upgrade
  • Handle Logical EOM (LEOM) on all devices that support it, drastically reducing the number of parts Amanda writes
  • Support for more cloud backends than just S3
  • Parallel writes to multiple devices

If you're interested, contact me (dustin@zmanda.com) and we'll work something out!

Thursday, March 25. 2010

What's New in Amanda: Postgres Backups

In the second installment a series of posts about recent work on Amanda.

The Application API allows Amanda to back up "structured" data -- data that cannot be handled well by 'dump' or 'tar'. Most databases fall into this category, and with the 3.1 release, Amanda ships with ampgsql, which supports backing up Postgres databases using the software's point-in-time recovery mechanism.

The how-to for this application is on the Amanda wiki.


Continue reading "What's New in Amanda: Postgres Backups"

Happy Ada Lovelace Day

Yesterday, March 24, was Ada Lovelace Day. I was at Pumping Station: One, and decided to spend an hour or so writing something to honor the first computer programmer. I was feeling singularly uninspired, and googling for "Ada Lovelace" didn't turn up anything interesting. But it did give me an idea: write a program that googles for you!


Continue reading "Happy Ada Lovelace Day"

Monday, March 15. 2010

Solving an Encoding Mystery

I don't write about it here, but I've been getting into brewing beer. I downloaded an app for my iPhone, iBrewMaster, which helps me store recipes and track batches of homebrew through the brewing, fermeting, and serving stages.

I recently decided to make a clone of Dogfish Head's Raison D'être. This beer is fantastic, but that's beside the point. I added the recipe to the app, and clicked save. In the menu, however, I saw "Raison D'√™tre". Not pretty. The app has a feature where you create a "batch" from a particular recipe. I did so, and the name of the batch appeared as "Raison D'‚àö‚Ñ¢tre". Even worse!


Continue reading "Solving an Encoding Mystery"

Friday, March 12. 2010

What's New in Amanda: Transfer Architecture

Amanda's primary mission in life is to move large quantities of data around. Historically, this has been done through a patchwork of methods, each written separately and with its own quirks. POSIX pipes, TCP sockets, shared memory, on-disk cache files -- Amanda's done it all. But these multiple implementations were error-prone, difficult to maintain, and often not the most efficient approach.

In an effort to remedy this, we introduced the transfer architecture, abbreviated XFA. This was technically included in Amanda-2.6.1, but was only used by amvault. In the upcoming Amanda-3.1 release, however, the XFA is central to all recovery operations, and is used internally by the taper (the portion of the backup system that writes to devices).

This post highlights some of the features of the transfer architecture, and some of the improvements we'd like to make.


Continue reading "What's New in Amanda: Transfer Architecture"

What's New in Amanda: Automated Tests

This is the first in what will be a series of posts about recent work on Amanda. Amanda has a reputation as old and crusty -- not so! Hopefully this series will help to illustrate some of the new features we've completed, and what's coming up. I'll be cross-posting these on the Zmanda Team Blog too.

Among open-source applications, Amanda is known for being stable and highly reliable. To ensure that Amanda lives up to this reputation, we've constructed an automated testing framework (using Buildbot) that runs on every commit. I'll give some of the technical details after the jump, but I think the numbers speak for themselves. The latest release of Amanda (which will soon be 3.1.0) has 2936 tests!

These tests range from highly-focused unit tests, for example to ensure that all of Amanda's spellings of "true" are parsed correctly, all the way up to full integration: runs of amdump and the recovery applications.


Continue reading "What's New in Amanda: Automated Tests"

Wednesday, February 17. 2010

Testing Legacy Code

I just read Roy Osherove's The Art of Unit Testing with Examples in .NET, on the advice of a slashdot review. I was not terribly impressed with the book, but reading it did help me to solidify my thinking about testing and test-driven development, and put words to concepts I had come to on my own.

Rather late in the book, Osherove describes three properties of good tests.

  • Trustworthiness - Do developers believe that passing tests mean things are working? Do developers believe that failing tests indicate a real bug?
  • Maintainability - Do developers think that tests are easy to add and maintain, or are they likely to avoid writing tests when rushed?
  • Readability - Do developers often consult the unit tests to see how the system under test is supposed to work?

What most struck me was that these properties were related to developers' perceptions of the tests, not the tests themselves. Tests are as much a social artifact of a project as a technical tool.

Buildbot's Tests

Around the time I was reading this, one of the more prolific Buildbot contributors commented, "I try not to change the tests - they scare me." Buildbot's tests were badly isolated, slow, and failed intermittently. As maintainer, I had grown accustomed to saying "oh, that test fails sometimes, don't worry about it" - a trustworithiness failure. Because of the terrible isolation, changing just about anything in Buildbot would cause dozens of tests to fail, requiring repetitive editing to fix - not maintainable. And the tests consisted of long sequences of operations and assertions, written in the Twisted style, which is already not readable. As a result, even I don't know what most of the tests are actually testing. This was a bad situation for any application, but particularly embarassing for a popular testing tool!

So I blew the tests away. Well, not really - I moved them to buildbot/broken_test/ in hopes they can be useful in writing new tests, and so that the braver souls among us can still run them. Now our metabuildbot is green, and I can legitimately ask for unit tests for new code.

There are costs associated with this move, too. A lot of people have worked very hard to write tests that have now been categorically labeled "broken," to whom all I can say is "I'm sorry". With far fewer tests and thus far worse coverage, it's also difficult to have confidence that Buildbot really works. The short-term workaround is to make a few beta releases and rely on real-world testing to suss out any problem.

So this is only the first step. We - I - still need to write real tests for the vast majority of the Buildbot code. That's particularly complicated because Buildbot's units are badly isolated, and interfaces are ill-defined. I will need to do a good bit of refactoring to bring it into compliance.

Saturday, February 13. 2010

Revising the allowForce option

Buildbot's WebStatus display has, for a long time, had an allowForce option which controls what kind of mayhem can be wrought via the web interface. Historically, this has been a boolean option: either web users can do everything (force builds and shut down slaves) or nothing. Bug 701 asks that we change that to give more granular access control.

Buildbot has an interesting way of separating the status display from the control functionality. It has two parallel interface hierarchies, IStatus and IControl, implementing the necessary methods. The IStatus hierarchy is illustrated with the orange bubbles here:

The IControl hierarchy is similar, although it only goes down to the Build level right now.

When allowForce is true, the WebStatus object adapts the buildmaster to the IControl interface and adds a link to the result in its control attribute. Forcing a build or shutting down a slave then uses this object to navigate to the appropriate control object and calls a method from the corresponding interface. If the control attribute is None, no access is allowed.

This scheme has the advantage that it is difficult to accidentally expose functionality, since when allowForce is false, the control methods are inaccessible. However, it has the disadvantage of not allowing any more granular level of access control.

I just reworked the web status to have a more flexible authorization mechanism, and while I wasn't able to remove the IControl hierarchy entirely, I was able to marginalize it to only those code blocks that need to perform controlled actions, instead of passing control objects all over the place.

Notice

The postings on this site are my own and don't necessarily represent the opinions of Zmanda, Inc.