Notes on backups for Linux home users -------------------------------------- by Mary Gardiner These notes are available under the Creative Commons Attribution 2.5 Australia licence: http://creativecommons.org/licenses/by/2.5/au/ These notes are based on a talk I gave at the Sydney Linux Users group on Nov 28 2008. These notes and also slides are available at http://users.puzzling.org/users/mary/Presentations/SLUG2008/ A note about style: this is a set of recommendations purely based on the fact that I have both backed my home data up AND recovered it. And having some working backup regime is better than none. I don't claim this is the One Best Way, merely One of the Adequate Ways That Isn't Entirely Maddening. The talk was on backups for home users. It doesn't cover mission-critical or business-grade backups. --- The magical 10 second version --- If you don't have backups, you should. Here's how: 1. Go out, right now, and buy an external hard drive as big as, or bigger than your main hard drive. Yep, there is no free lunch. 2. Install the program called rdiff-backup. 3. Plug in the drive and run: sudo rdiff-backup --exclude-other-filesystems ::/ ::/media/disk (/media/disk being the place your external drive mounted, under Ubuntu, substitute as needed) 4. Run that as often as you can. (Every so often, run "sudo rdiff-backup --force --remove-older-than 60D ::/media/disk" or similar to delete very old backups.) See http://jwz.livejournal.com/801607.html for someting similar (and the partial inspiration for the talk), although rsync doesn't save older data, which I definitely recommend. --- More about rdiff-backup --- Do check out the webpage and "man rdiff-backup" for full details. http://www.nongnu.org/rdiff-backup/ In summary, it's 'reverse' increments, if you will. That is, you can get the most recent backup just by looking at the filesystem under /media/disk. Older versions are recovered by rdiff-backup applying older and older chnages incrementally to the files, and are recovered like so: sudo rdiff-backup -r 1D /media/disk/path-to-file [destination you'd like to restore to] --- Why you need backups --- You may not want to protect against all of these things: some of them are expensive or time-consuming to protect against. But consider these risks when deciding on your backup regime. 1. Accidental deletion: very common. An on-site backup is good enough to recover from this. 2. Media failure (dead hard drive). This will happen to you, sooner or later. You may or may not get any warning. An on-site backup *on a different disk* is good enough to recover from this. Not a different partition, a different *disk*. This is the only one RAID helps with too provided that (a) you have a full mirror on the other disk(s) in the array and (b) you don't stuff it up somehow and set the new empty disk as the master. RAID is not a substitute for backups. *Not* a substitute for backups. It won't help with 1, 3, 4 or 5. 3. Software failure. Say some bit of code, from the drive firmware to the filesystem to the end user software (eg GIMP) has a bug in it and writes out your data incorrectly. In most cases this is rather like accidental deletion, but if the bug is very low level (kernel) it may affect the backup too. At the very least, have your backup drive be not the same manufacturer and model as your main drive. This makes them less likely to share the same bugs and less likely to fail at the same time. 4. Provider failure. You have uploaded your valuble data to Flickr, LiveJournal, WordPress.com etc etc. They go bust, and their creditors swoop in, turn their machines off and sell them for scrap parts. This really happens, see http://blogs.zdnet.com/digitalcameras/?p=362 for an example. Smaller examples are the occasional data loss that a lot of web services, up to and including those run by Google, have. 5. Massive local failure. Flood, fire, surge: we had victims of two of these at the meeting. And someone who had had all their computer equipment stolen in one go. To recover from these you need an(other) backup, as far away from your main data store as you can. At least in a different suburb. Another country is entirely possible these days, if you have broadband. --- Media recommendations --- For home users, get another hard drive and backup to that. Optical media: no. CDs and DVDs are too small for most people now. You will have to insert at least 5 of the things for a full backup cycle. So that's boring and dull, so you'll never do it. Also, they have a short-ish lifespan and testing their backup goodness is even *more* boring and dull, so you definitely won't ever do that. Solid state media: no. Consumer grade SSDs are really really unreliable right now. You need to back them up, not use them *for* your backups! See http://valhenson.livejournal.com/25228.html for an extended take on SSD maturity. When either your main drive or your backup drive fails GO AND BUY ANOTHER ONE RIGHT AWAY. --- Testing backups --- I tend to do this by needing to use them about once every three weeks (accidental deletion). Otherwise, at least fsck them every so often for basic integrity checks. --- Emergency recovery tools --- You shouldn't need these with good enough backups, but just in case... === Accidental deletion === STOP WRITING TO THE DRIVE RIGHT AWAY. Either re-mount it read-only, or take an image of it if you have room: dd if=/dev/drive of=/mnt/otherdrive/drive.image Try photorec and foremost to find files on it: http://www.cgsecurity.org/wiki/PhotoRec and http://foremost.sourceforge.net/ (installable from package repositories) photorec is good for more than photos: deleted documents and video are often found too. (It's called photorec because it was originally designed to get files back when deleted from digicam memory cards.) === Media failure === If it's only a partial failure (as in, you're still reading/writing sort of OK, just with increasing numbers of failures) STOP WRITING TO THE DRIVE RIGHT AWAY. FSCKS INCLUDED (that's the "errors have been found on your drive, fix y/n?" thing. If these keep happening over and over stop trying to fix them, turn your machine off, get the disk out and get your data off!) You need to take an image of it onto another, bigger, drive, ddrescue is good for this because it is especially designed for imaging damaged drives. Confusingly the installable *package* is called gddrescue, but the commandline tool is ddrescue: ddrescue /dev/drive /mnt/otherdrive/drive.image then you can fsck the image (since it's now on a good drive): fsck /mnt/otherdrive/drive.image then mount the image and see what you can dig out of it: mount -o loop /mnt/otherdrive/drive.image /mnt/mountpoint --- Remote backups --- In addition to your spare hard drive, consider remote backup. There are several forms: 1. Sneaker-net. Buy *another* external hard drive, bring it home once a week or so, back up to it, take it to your work and store it there (or at someone else's house or in a safety deposit or whatever). This is a nuisance, but for large amounts of data it is the cheapest. 2. S3. Amazon's S3 storage is US10c per GB to put data in, US15c per GB per month to keep data, and US17c per GB to get data out. This starts to add up when you're talking hundreds of GB (for 200GB it's about US$30 a month) http://aws.amazon.com/s3/ There are lots of tools for putting data in S3, see http://jeremy.zawodny.com/blog/archives/007641.html I've heard good things about Duplicity http://www.nongnu.org/duplicity/ here, but if you do one of the filesystem things you could even use rdiff-backup. 3. Dreamhost personal backup. See the CAUTION below, but the webhost Dreamhost now has 50GB specifically for personal backups (10c per GB after 50): http://wiki.dreamhost.com/V10.08_August_2008 http://wiki.dreamhost.com/Personal_Backup CAUTION: There are lots of webhosts that advertise hundred of GBs, or terabytes of disk space. DO NOT USE THESE FOR BACKUPS. They (almost?) all have a term of use that says that you cannot use them for non-public data such as backups, you're supposed to use them for web accessible data only. People have had their backups on these services deleted without notice. CAUTION 2: The remote backup space is more than a little crowded right now. Keep an eye on your provider, some are undoubtedly headed for failure. CAUTION 3: Essentially all commercial remote backup providers require that you not violate copyright by putting data on their servers that you do not have the rights to copy. CAUTION 4: You might be tempted to encrypt your remote backups, if so, do think about where you are going to keep the key so that you have it after your flood, fire and surge! --- Data you store elsewhere --- Try and get hold of your (g|e)mail, your calendars, your address book, your photos, your blog, your social network ... all your meta-data. Some of this is quite hard or impossible to get hold of: eg LiveJournal doesn't allow exports of comments on your journal, Facebook doesn't allow much our at all. Dump them to your local hard drive and they'll be picked up by the usual backups. --- Fancy stuff --- The surest way to back up is automatically! Consider using cron to backup. You can also try udev for backups whenever a particular drive is plugged in (see http://www.cafuego.net/2007/11/11/time-machine-kinda for ideas on this), and it's apparently possible to get Network Manager to automatically run backups when you connect to a particular network, useful for laptops. If you're backing up databases (eg, MySQL for your blog or whatever) make sure to dump them before backing up: live database backups seldom restore well, you need dumps. --- My own backups --- Linode server: backed up nightly to a machine in my house via rdiff-backup. That backup is in turn rsync-ed to a separate disk. Home server (contains mail, photos, music and financial records): there are two disks in it, one is an rdiff-backup of the other. I want to implement a remote backup of everything except the music, but it's still a lot of data. Laptop: same as the Linode, but I trigger the backup manually. (If you do this, try and reduce it to a single button press so you'll do it fairly often.) --- Misc info --- Some people asked whether there is anything going on in the space of pressuring web services to provide full backup (or general export) solutions: see http://autonomo.us/ and http://www.dataportability.org/ for some movement in this space.