Automatic Backups with rsync

Automatic Backups with `rsync`

Ted Ruegsegger

Abstract

Describes a simple method to build and operate a server that maintains copies of specified client file trees, using the efficient rsync tool to capture changes, secure shell (ssh) so that network connections need not be trusted, and keychain to allow ssh keys to be used in unattended operation.

What's rsync?
Why another rsync How-To?
Objectives
Ingredients
Configuring the Server
Configuring the Clients
Copying Manually
A Word on Strategy
Automating It
Restoring
What About Windows Clients?
What About Macintosh Clients?
Enhancements
Resources

What's `rsync`?

rsync is a tool to replicate files between two locations, typically on separate hosts connected by a network. It uses a clever algorithm to detect differences in files so that only the differences need be transferred, making regular backups efficient and fast.

Why another `rsync` How-To?

The manpage and the wealth of documentation that comes up in a Google™ search can daunt the reader who simply wants backups, because most of it discusses other uses of rsync (for example, running a file server—essentially a more efficient ftp archive—and mirroring websites). Also, much of the guidance on using ssh in scripts proposes using a key with a null passphrase, a Bad Practice.

I waded through a lot of verbiage before I understood how to do what I want. In fact, it's simple and straightforward. I wrote this to save others the time and bother.

Objectives

Establish a server to maintain copies of file trees owned by various users on any number of computers.
Use rsync over ssh to replicate the file trees.
Automate the process so the backups take place without user intervention, using keychain to manage the ssh keys.

Ingredients

1 unused computer: I had a 333MHz Pentium II with 128 MB RAM. I understand rsync does take more than trivial amounts of processing power and RAM, but I don't have numbers (if you do, please share).
1 modest-sized disk for the operating system: I had a 1.2GB drive lying around, but half that would do easily. In fact, there's no particular reason you couldn't put the OS on the data disk, assuming you can boot from it.
1 big honkin' disk: I got a 250GB monster, several times what I think I need, but we all know how disk space needs grow [Update: just a few years later, space is running low; on the other hand, terabyte disks are now commonplace].
Good ol' Debian GNU/Linux: Naturally, you're welcome to use your favorite distribution of your favorite OS, as long as you can get rsync, ssh, and keychain for it. If you're using GNU/Linux, I recommend Debian (stable) for any production server. This has the advantage of being almost trivially easy to update without breaking your applications, since Debian stable installs only fixes and security patches and never upgrades to a functionally different version of a package.

Configuring the Server

Install the operating system and essential packages. For those using Debian GNU/Linux (stable), I can offer some tips:
- Since I build such servers often, I've assembled a "standard" set of configuration files and scripts, my Debian HostConfig Kit (see Resources, below) to make it easy to set up a new box. Unless you're planning to run other services on the box, you can omit most of the application packages and the desktop environment.
- Manage all configuration files, scripts, and documentation using a configuration management tool like CVS. Typically, the CVS root won't be on the same server; if it is, be sure to back it up.
If needed, build a custom kernel; in my case, when I first built it I needed to enable the ATA133 controller so I could see all of the big honkin' disk, but more recent stock kernels work just fine.
Install an ext3 filesystem on the big honkin' disk.
Create a mount point /bkp and mount the big honkin' disk. Don't forget to add an entry to /etc/fstab like:
/dev/hde1 /bkp ext3 defaults 0 2
Make a directory under /bkp for each client machine; in my case I have /bkp/grins/, /bkp/mononoke/, /bkp/pikachu/, and so forth.
Be sure to install rsync and ssh on the server.
Create users and groups as needed, preferably the way you have them on the client boxes.

Configuring the Clients

Install rsync, ssh, and keychain on each client. Then, for each user that will be running rsync:

Log in as that user.
Generate ssh keys, specifying passphrase(s) when prompted.:
ssh-keygen -t rsa
ssh-keygen -t dsa
Define an alias to start ssh-agent and load the ssh keys.
For Bourne-type shells (sh, bash, ksh, zsh...):
alias gokeychain="keychain --nogui $HOME/.ssh/id_rsa $HOME/.ssh/id_dsa ; \ . $HOME/.keychain/$(hostname)-sh"

For csh-compatible shells:

alias gokeychain keychain --nogui ~/.ssh/id_rsa ~/.ssh/id_dsa ; \ source ~/.keychain/${HOSTNAME}-csh
Execute the alias. keychain will start the agent and prompt for the passphrase(s). After that, the keys will be in memory until you explicitly remove them or reboot the machine.
Install the public keys in the corresponding authorized_keys file on the server, for example, if I intend to copy files using user account ted on server nox:
ssh-copy-id ted@nox

To reload the keys, typically after rebooting the machine, for each user that will be running rsync:

Log in as that user.
Execute the alias (what I've called gokeychain above).

Copying Manually

Look over the rsync manpage, but ignore all the stuff about running rsync in daemon mode; that's for a public service, essentially a more efficient ftp server, and doesn't encrypt the traffic. In particular, examine the command-line options to rsync and identify the ones you need for your situation.
For each client, back up the file trees you need. For example, to back up my files on mononoke, I might run (as regular user ted):
cd /home rsync -av --delete --delete-excluded \ --exclude "tmp" \ --exclude "[cC]ache" \ ted ted@nox:/bkp/mononoke

where:

-a

means "archive", equivalent to -rlptgoD. It's a quick way to say you want recursion and want to preserve the file attributes as they are on the client.

-v

is "verbose"; you can drop that once you're confident it's all working. While you're testing, you might add --progress to give you more info.

--delete

means delete stuff from the target if it's gone from the source.

--delete-excluded

goes even further and says delete anything on the target that isn't included in what you're asking it to back up.

--exclude

is just what it sounds like; pick the ones that work for you, using the "EXCLUDE PATTERNS" section of the manpage for guidance on specifying patterns. See the note on strategy, below.

ted

specifies the directory tree(s) under /home that I wish to replicate. See the note on strategy, below.

ted@nox:/bkp/mononoke

says log into the server (nox) as user ted and put all this stuff under /bkp/mononoke/
The first time you run the command, it will take considerable time to copy everything to the server. When you repeat the command after that it's much quicker—it takes a while to deliberate about what needs to be transferred, then transfers only the files, or parts of files, that have changed.
Notes on users and permissions:
- The user on the client machine obviously needs read access to all the files and directories to be copied.
- The user on the server needs to be able to write the files. You may need to add the server user to various groups in order to achieve this. It's a good idea to have a consistent set of user and group IDs on the server and all clients.
- You could also run the backups as the root user on the client, the server, or both, eliminating all permission issues, but raising other issues when you automate the process (I'm uncomfortable having root's ssh keys in memory).

A Word on Strategy

It's tempting to exclude just obvious files (like "cache") and then explicitly include the directories I want to back up. For a single, manual backup, this is ok, but for automated backups, this is a poor strategy; if a user adds a new top-level directory on one of the clients, it won't get backed up unless I explicitly add it to the script. This violates my "no user intervention" objective.

A better approach is to specify all directories with a * (or by naming the parent directory) and then add an --exclude clause for each tree that I don't want. This way, any new directory gets backed up automatically.

Of course, there are exceptions. For example, suppose we're certain that all valuable stuff gets placed only in certain subdirectories and never in the parent and that, furthermore, the parent accumulates lots of files and directories whose names might not be predictable. In such a case, it makes sense to start in the parent directory and specify the directories we want, knowing that any newly-added stuff we care about will always be in one of them. That's easier than running a separate rsync for each subdirectory, or trying to keep up with excluding files in the parent that come and go.

Automating It

To save yourself drudgery and error, put it all into a script. Let the script build the command based on the user and the client hostname. Put the script somewhere where each user can execute it, like /usr/local/bin/syncfiles.sh on each client. It should look something like this (by default, any user on any host will back up that user's home directory, but you can add case clauses for particular users and hosts):

#!/bin/sh
######################################################################
# syncfiles.sh
# Replicate file trees to server using rsync
# Usage:
#     sh syncfiles.sh
# or call from cron (make sure ssh key is loaded beforehand)
# Requirements:
#     Local user name must match remote (on server) user name
######################################################################
Host=$(hostname)
User=$(whoami)
keychain $HOME/.ssh/id_rsa ~/.ssh/id_dsa
. $HOME/.keychain/$(hostname)-sh
Excludes=
cd home
#rsync -e ssh -av --delete --delete-excluded \
rsync -e ssh -a --delete --delete-excluded \
 --exclude "tmp" \
 --exclude "[cC]ache" \
 $Excludes $User $User@nox:/bkp/$Host

Once you've decided what to back up, decide when and how often.

For a laptop used mainly by a single user, connecting to the LAN intermittently, it may be sufficient to execute /usr/local/bin/syncfiles.sh manually from time to time.

For hosts that reside on the LAN, or that have multiple users, it makes sense to schedule the rsync operations with cron, with a separate crontab entry for each user. For example, ted's crontab entry might look like this:

# Back up files to Nox nightly at 03:36 AM:
36 3 * * * sh /usr/local/bin/syncfiles.sh

Restoring

Since catastrophes rarely happen, a painless automated backup system that quietly and reliably does its job can lull us into forgetting the whole point of doing backups in the first place: restoring our data. Make a point of running some test restores when you first start making backups, so you can note any surprises. Ideally, you should run a test restore on a regular basis.

Note: Don't use scp to restore; you'll have problems with links. As far as I can tell, scp doesn't understand links and simply treats them as regular files or directories. This will make duplicates and, in the worst case, can make endless loops (for example, if a symbolic link points to a parent directory).

To restore using rsync, just reverse the procedure and omit (unless we don't want to restore everything we backed up) the --exclude and -- delete options. For example, if we backed up the contents of the /home/ted directory with these commands:

cd /home

rsync -e ssh -av --delete --delete-excluded \

--exclude "tmp" \

--exclude "[cC]ache" \

ted ted@nox:/bkp/mononoke

then to restore them we use these commands:

cd /home

rsync -e ssh -av ted@nox:/bkp/mononoke/ted .

Now ted@nox:/bkp/mononoke/ted is the source and . (the current directory) is the target. Note that on our client either the directory /home/ted must already exist or we must have permissions to create it.

We can also restore individual files, which probably happens more often than a catastrophic loss of entire directories or disks:

cd /home/ted/recipes

rsync -e ssh -av ted@nox:/bkp/mononoke/ted/recipes/fondue.html .

What About Windows Clients?

Yes, it's possible to run rsync from Windows clients, should you have users thus afflicted.

If they happen to be running Cygwin (see Resources, below), just follow the same instructions as for GNU/Unix, above. Cygwin includes all the packages you need, including keychain.
As an interim solution, run a Samba server, encourage the Windows users to keep their important data on their shares ("network drives"), and back up the Samba server with rsync.
If you don't want to install a full-blown Cygwin just for backups, note that the Cygwin installer lets you set up a minimal system and then just add the tools and libraries you need. Geoff Breach, in Sys Admin magazine (see Resources, below), describes such an approach (skip down to "Installing Win32 Client Software"). If I were to go to all that trouble, I would also add keychain rather than use a key with a null passphrase.

Since I'm lazy, I was delighted to find that ITeF!x (see Resources, below) combined rsync and elements of Cygwin to build cwRsync, distributed as a single "Installer" file for Windows. That's the approach I describe here. Its only drawback is the use of a null passphrase (the author says he plans to add support for keychain), but the easy setup makes it the best method I've found so far for Windows. I've tested it on Windows 98 and Windows XP.

Download cwRsync from the ITeF!x site.
Unzip and run the installer; you can omit the rsync server unless you want it for some other reason. Assume cwrsync is installed in C:\Program Files\cwrsync\ in the following examples. In DOS batch scripts we'll write this as C:\progra~1\cwrsync\ since scripts have trouble with embedded spaces.
Be sure the windows box can find the backup server; if you don't have a local DNS server, then put the backup server's name and IP address in the hosts file (windows\hosts in Win9x, windows\system32\drivers\etc\hosts in WinXP), or just use the backup server's IP address in your rsync scripts.

For each user on the Windows machine:

Establish a "home directory"; ssh will use this location to maintain keys and the known_hosts list. Assume our user is "ebenezer" with home directory C:\home\ebenezer in the following examples.
As the user, in a "DOS command window", generate ssh keys with, alas, null passphrases:
c: cd \home\ebenezer mkdir .ssh c:\progra~1\cwrsync\ssh-keygen -t rsa -N "" -f .ssh\id_rsa c:\progra~1\cwrsync\ssh-keygen -t dsa -N "" -f .ssh\id_dsa
By any means available, copy the public keys (here, id_rsa.pub and id_dsa.pub) to the backup server and append them to the user's $HOME/.ssh/authorized_keys file. A simple way is to use rsync interactively; since the keys aren't yet installed, it will prompt for a password. Assuming you've copied them to the /tmp directory, on the backup server:
cat /tmp/id_rsa.pub /tmp/id_dsa.pub >> /home/ebenezer/.ssh/authorized_keys
Copy the script template C:\Program Files\cwrsync\cwrsync.cmd to the user's home directory, changing the extension to .bat for Win9x (WinXP will run either form). Use Windows Explorer or just type:
copy C:\progra~1\cwrsync\cwrsync.cmd C:\home\ebenezer\cwrsync.bat

Edit the script. It sets some environment variables, then executes the actual rsync commands, and finally resets the environment variables (not sure why, but I'm no DOS expert). Note that the primitive DOS command-line feature in Windows has a limited line length and doesn't support continuation lines, so a typical rsync command line would be too long. As a workaround, use environment variables; that also makes it all more readable. You should end up with something like this:

@ECHO OFF
REM **********************************************************
REM
REM CWRSYNC.CMD - Batch file to start your rsync command (s).
REM
REM By Tevfik K. (http://itefix.no/itefix-en)
REM
REM **********************************************************
SET CWRSYNCHOME="D:\progra~1\CWRSYNC"
SET CYGWIN=nontsec
SET HOME=C:\home\ebenezer
SET CWOLDPATH=%PATH%
SET PATH=%CWRSYNCHOME%;%PATH%
REM ** CUSTOMIZE ** Enter your rsync command(s) here
SET RSYNCCMD=rsync -e %CWRSYNCHOME%\ssh -av --delete --delete-excluded
SET EXCLUDES=--exclude "[Tt]emp" --exclude "RECYCLE[DR}"
SET EXCLUDES=%EXCLUDES% --exclude '*[Cc]ache' --exclude '[Cc]ache*'
SET EXCLUDES=%EXCLUDES% --exclude 'Temporary Internet Files'
SET REMOTE=ebenezer@nox.home:/bkp/winbox
c:
cd \
SET DIRS=dev docs home mp3 "My Documents" ssh
echo Backing up from C: drive: %DIRS%
%RSYNCCMD% %EXCLUDES% %DIRS% %REMOTE%/c
d:
cd \
echo _________________________________________________________
echo Backing up from D: drive: [all]
%RSYNCCMD% %EXCLUDES% * %REMOTE%/d
set HOME=
set CWRSYNCHOME=
set CYGWIN=
set PATH=%CWOLDPATH%

where the rsync options and arguments are the same as before. Environment variables are invoked with %varname%. Note that we're handling each lettered disk drive separately. Note also that, for the C: drive, we're explicitly specifying subdirectories to back up, since we put all the stuff we care about inside them and the root directory tends to accumulate garbage we don't need. This is the exception mentioned in the note on strategy, above. But beware: if you add a subdirectory to C: that needs backing up, you'll have to edit the script.

In the user's home directory, where the batch script is, make a shortcut.
In Win9x, the script may get "Out of environment space" errors. To fix, in Windows Explorer, right click on the shortcut file, select Properties, select the Memory tab, and change the Initial environment setting to something other than Auto (1024 worked for me).
Test the script by double-clicking the shortcut. As before, the first time you run the command, it will take considerable time to copy everything to the server. When you repeat the command after that it's much quicker.
If you want a shortcut on the desktop or elsewhere, copy the shortcut you just made.
Windows has a "Task Scheduler" (Start/Programs/Accessories/System Tools/Scheduled Tasks) with which you can run the backup script automatically. Be sure you point it to the shortcut (the .pif file) rather than the batch script so that you get the environment space setting.

What About Macintosh Clients?

I understand MacOSX is a FreeBSD derivative, so presumably you can follow the same instructions as for GNU/Unix, above, probably with some adjustments. Previous versions of MacOS may or may not support rsync. This exhausts my knowledge of the Macintosh world. If someone kindly points out any documentation that would help Mac users use rsync, I'll be happy to link to it.

Enhancements

Can't leave well-enough alone? Possibilities abound:

Add another big honkin' disk as a RAID mirror of the first.
Add another big honkin' disk and use rsync to mirror the first.
Use a tape drive or DVD burner to make archival backups, now that everything you want to preserve is in one place.
Use rsync to replicate the big honkin' disk on a counterpart that's geographically removed.

Resources

Contacting me with questions or suggestions: Send email to:
Debian Installation Manual: http://www.debian.org/releases/stable/installmanual
Open Source Development with CVS: http://cvsbook.red-bean.com/
My Debian HostConfig Kit: http://www.tux.org/~tbr/debianhostconfig
J.C. Pollman's NOVALUG presentation, where I first learned about rsync: http://novalug.tux.org/presents/cooltools/
Cygwin: Cygwin is "a Linux-like environment for Windows" or, more accurately, "a GNU-like environment for Windows" since it provides an emulator in place of a (Linux or other) kernel. With Cygwin, you can run all kinds of GNU/Unix applications and services without leaving your beloved Windows operating system.; http://www.cygwin.com/
Breach, "Windows Client Backups with rsync and FreeBSD": Breach, Geoff, "Windows Client Backups with rsync and FreeBSD", Sys Admin, Vol 13, No 10, October 2004, p.6. Online at; http://www.samag.com/documents/s=9338/sam0410a/0410a.htm
cwRsync by ITeF!x: http://www.itefix.no/cwrsync

Updated 19 Mar 2008 tbr.