Incrementally backing up Cassandra with Amanda

Warning: this has not been tested yet.

Again, TL;DR version at the end.

They say that backing up in C* really easy: you just run nodetool snapshot, which only creates a hardlink for each data file somewhere else in the filesystem, and then you just backup those hardlinks. Optionally, when you're done, you simply remove them and that's it.

But that's only the half of the story. The other half is taking those snapshots and storing them somehwere else; let's say, a backup server, so you can restore the data even in case of spontaneous combustion followed by explosion due to shortcircuits caused by your dog peeing on the machine. Not that that happens a lot in a datacenter, but one has to plan for any contingency, right?

In our case we use Amanda, which internally uses an implementation of tar or GNU tar if asked for (yes, also other tools if asked). The problems begin with how you define what to backup and where does C* put those snapshots. The definitions are done by what Amanda calls disklists, which are basically a list of directories to backup entirely. In the other hand, for a column family Bar in a keyspace Foo, whose data are normally stored in <data_file_directory>/Foo/Bar/, a snapshot is located in <data_file_directory>/Foo/Bar/snapshots/<something>, where something can be a timestamp or a name defined by the user at snapshot time.

If you want to simplify your backup configuration, you'll probably will want to say <data_file_directory>/*/*/snapshots/ as the dirs to backup, but Amanda merrily can't expand wildcards in disklists. A way to solve this is to create a directory sibling of <data_file_directory>, move the files in the snapshots there, and specify it in the disklists. That kinda works...

... until your second backup pass comes and you find out that even when you specified an incremental backup, it copies over all the snapshot files again. This is because when a hardlink is created, the ctime of the inode is changed. Guess what tar uses to see if a file has changed... yes, ctime and mtime1.

So we're back to square one, or zero even. Seems like the only solution is to use C*'s native 'support' for incrementality, but the docs are just a couple of paragraphs that barely explain how they're done (suprise, the same way as the snapshots) and how to activate it, which is the reason why we didn't followed this path from the beginning. So in the end, it seems that you can't use Amanda or tar to make incremental backups, even with the native support.

But then there's a difference between the snapshot and the incremental mode: with the snapshot method, you create the snapshot just before backing it up, which sets all the ctimes to now. C*'s incremental mode "hard-links each flushed SSTable to a backups directory under the keyspace data directory", so they have roughly the same ctime as the mtimes, and neither never ever changes (remember, SSTables are inmutable) again (until we do a snapshot, of course).

One particularity that I noticed is that only new SSTables are backed up, but not those that are the result of compactions. At the beginning I thought this was wrong, but after discussing the issue with driftx in the IRC channel and a confirmation by Tyler Hobbs in the mailing list, we came to the following conclussion: with also compacted SSTables, at restore time you would need to do a manual compaction to minimize data duplication, which otherwise means more SStables associated by the Bloom filters and more disk reads/seeks per get and more space used; but if you don't backup/restore those SStables, the manual compaction is only advisable. Also, as a consequence, you don't need to track which files were deleted between backups.

So the remaining problem is to know which files have been backed up, because C* backups, just like snapshots, are not automatically cleaned. I came up with the following solution, which at the beginning it might seem complicated, but it really isn't.

When we do a snapshot, which is perfect for full backups, we previously remove all the files present in the backup directory; incremental files since the last incremental backup are not needed because we're doing a full anyways. At the end of this we have the files ready for the full; we do the backup, and we erase the files.

Then the following days we just add the dynamic backups so far, preceded by a flush, so as to have the last data in the SSTables and not depend on CommitLogs. As they're only the diff against the files in the full, and not the intermediate compacted SSTables, they're as big as they should (but also as small as they could, if you're worried about disk ussage). Furthermore, the way we put files in the backup dir is via symlinks, so it doesn't change the file's mtime or ctime, and we configure Amanda to dereference symlinks.

Later, at restore time, the files are put in the backup directory, and with a script that takes the KS and CF from the file's name, they're 'dealed' to the right directories.

TL;DR version

Full backup

  • Remove old incremental files and symlinks.
  • nodetool snapshot.
  • Symlink all the snapshot files to a backup directory
  • Backup that directory dereferencing symlinks.
  • nodetool clearsnapshot and remove symlinks.

Incremental backup

  • nodetool flush.
  • Symlink all incremental files into the bakup directory.
  • Backup that directory dereferencing symlinks.

Restore2

  • Restore the last full backup and all the incrementals.

  1. tar's docs are not clear in what exactly it uses, ("Incremental dumps depend crucially on time stamps"), but Amanda's seems to imply such a thing ("Tar has the ability to preserve the access times[;] however, doing so effectively disables incremental backups since resetting the access time alters the inode change time, which in turn causes the file to look like it needs to be archived again.") 

  2. Actually is not that simple. The previous post in this series already shows how it could get more complicated.