Managing lots of servers (by )

The way OSes work was rather designed around the notion of large centralised systems, which is increasingly out of date. There's too much per-node configuration splattered around all over the place; you have to manually set up processes to deal with it all, otherwise you start to lose the ability to replace machines since their configuration contains mysteries that you may or may not be able to recreate if you rebuild the OS install. You really need to be able to regenerate any machine from scratch in as little time as possible - not just restoring from a backup; you might need to recreate the function of a machine on new hardware, since you can't always get replacements for old setups.

Now, if you can get away with it, you can build your application to support distributed operation - then just set up a sea of identical OS installs by using a configuration script that sets up a server and runs an instance of your distributed app, but a lot of off-the-shelf software sadly fails to support that kind of operation. If you want to run standard apps that use filesystems and SQL databases to store their data, you need to be cunning.

How can we do better? Here's a few (rather brief) notes on an approach I'm experimenting with.

Boot your OS from a CD

Not quite a liveCD, though, since we do want to actually use the local disk for stuff.

  • modified RC chain that (union?) mounts /etc from the hard disk and then runs with that to get local network etc. configuration.
  • swap, /tmp, /var etc on the hard disk, obviously.
  • Makes rolling out new versions of the OS easy; forces you to prototype the system on a hard disk on a staging server or VM, then when it's ready, burn a CD, test the CD in a staging server, then if it works, burn a hundred copies and roll them out. USB sticks are another option, but a little more awkward in datacentre environments. Cost of having a human go and re-CD every server exists but is low and provides safety compared to automatic rollouts that could go disastrously wrong. The fact you can roll back by putting the old CD back, having a truly read-only root filesystem and OS (making it harder to hide rootkits) is great, though!

Use Xen

  • The actual loaded OS is just a Xen dom0 setup
  • Prebuilt domU root images exist on the CD-ROM, which are then spun up (based on settings in /etc on the hard disk). The root images get given a partition from the hard disk which contains their /etc and swap, and any local storage they need, in much the same way as dom0 boots directly from the CD.
  • Or your read-only domU root images could be stored on the hard disks of the servers and rolled out via the network; the advantages of distributing them on CD-ROM are a lot smaller than for the dom0 OS, as dom0 can enforce the read-only nature of the domU images, provide remote access to roll back to an earlier version and try again if an upgrade turns out to be bad, etc.

Virtualise storage

  • Use local storage on servers just for cached stuff and temporary storage. Eg, we have each server's configuration stored on local disk so it can boot, but that's just checked out from subversion. We put swap on local disks. But the contents of any server's disks should be recreatable by checking the configuration out from SVN again and/or rsyncing any shared mainly-read-only data (domU images etc) from authoritative copies.
  • For actual data that we care about, use network protocols (iSCSI, NFS, SQL, etc) to talk to special reliable storage services.
  • For domUs that have a criticial local filesystem, we use iSCSI. However, we use software RAID to mirror (or parity-protect) the filesystem over more than one physical iSCSI server, so that either can fail without losing data or causing downtime. Since the domU itself then stores nothing, should it fail (or the physical server hosting it fail), an exact duplicate can be brought up on another physical server and it will connect to the same iSCSI servers to provide access to the same data (and we hope that the filesystem used can recover from any corruption that arose during the failure, or else we're toast anyway).
  • Higher level storage protocols (NFS, SQL, etc) are served out from domUs that, as above, have stable block-level storage from software-RAIDed iSCSI backends. And, likewise, should the NFS server go down, we can resurrect an identical clone of it from the same iSCSI backend disks and it will carry on with the state the failed one left behind.
  • But where possible, use proper distributed/replicated databases!


  • The dom0 ISO contains a bootloader, Xen, and NetBSD set up as a dom0 kernel, with /usr/pkg containing a bunch of useful core packages (sudo, subversion-base, screen, xentools, etc)
  • The dom0 ISO chain will:
    1. Mount the first partition on the first disk in the server that has a special marker file in as /config
    2. union-mount /config/local/etc/ over /etc
    3. now read /etc/rc.conf
    4. Run the normal /etc/rc tasks, including mounting /var and /tmp from the hard disk, mounting data partitions and setting up networking, ipnat, ipf, etc.
    5. Scan the list of Xen domUs to start from /config/local/domUs/* and start them, each with the correct disk images (from the data partitions), MAC address, and memory allocations.
  • /config/local and /config/global are svn checkouts
  • On all machines (dom0s and domUs), /etc/hosts is a symlink to /config/global/hosts, and any other such useful files.
  • domUs run pkg_chk, but don't have /usr/pkgsrc; they fetch compiled binary packages from a repository domU running the same base OS, which builds every package in pkg_chk.conf. This domU might need to be the NIS master, since that would be the only way to keep pkgsrc-created role user UIDs in synch.

How to bootstrap it

  • We need documented procedures for setting up a dom0 iso image, to make sure no important steps are missed...
    • Make a working directory
    • Install NetBSD source sets
    • Set up custom /etc/rc that finds a suitable filesystem to locate /etc from and mounts it as /config - or drops to a shell if none can be found.
    • Make a Xen3 dom0 kernel with "config netbsd root on cd0a type cd9660 dumps on none" and "options INCLUDE_CONFIG_FILE"
    • Put in the Xen 3 kernel
    • Configure grub menu.lst to load NetBSD on top of Xen.
    • Install core packages (xen tools) - /var/db/pkg and /usr/pkg will union mount over what we provide to allow for node-local extensions, although we shouldn't need too many in dom0.
    • Install grub and mkisofs as per
  • We need domU read-only root filesystem images created along a similar theme

Subversion layout:

  • /pservers/ - root of /config/local for each physical server
  • /vservers/ - root of /config/local for each virtual server
  • /global - /config/global
  • /docs - configuration checklists and scripts for dom0 and domU master images


  • By David McBride, Wed 27th Aug 2008 @ 10:02 pm

    You may find "Bootstrapping an Infrastructure" interesting reading. It dates from 1998, so some of the tech examples are out of date, but it describes what you're doing quite closely.

    The idea of having several basic -- a VM hosting layer, a distributed, reliable storage layer, and a higher layer of auto-installed VMs that provide all your core functionality -- is very attractive. (It's what DoC's been doing for years, for various degrees of storage distribution and virtualization.)

    The idea that individual chunks of server hardware are expendable, however, is a useful one -- when building on unreliable commodity hardware, its vital! Key to this is the ability to either run up a clone -- or regenerate from first principles -- the machine image that's providing a particular service, and keeping its persistent datasets on a separate reliable storage infrastructure.

    Cluster filesystems -- GFS, GFS2, Lustre, etc -- are starting to look exciting and usable, though I haven't had a chance to play with them yet. They look particularly important for IO-intensive workloads.

    Control layers for your VMs start to become important. Most of the time, you want scripts on the work machines to pull changes from central repositories -- see DoC's Sys::Maint infrastructure, triggered by Cron. Their function is to take a (possibly broken, or minimally bootstrapped) machine image, and to make it good and ready for normal operations. This can include everything from setting up NTP to installing packages to making sure the correct system configuration files have been installed.

    The other important control layer is the ad-hoc "push" mechanism -- for when you need to execute an important change to a (total) subset of machines right now. Batch tools that connect over SSH using some means of automated authentication is good for this; for example, see my remote script.

    It sounds like you're definitely thinking along the right lines, though how much of this you need to implement may vary depending on how far up you need to scale (and with what resources).

  • By @ndy, Thu 28th Aug 2008 @ 9:53 am


    As David points out, the "Congruent Infrastructure" stuff at is a good read.

Other Links to this Post

RSS feed for comments on this post.

Leave a comment

WordPress Themes

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales