Incrementally backing up Cassandra with Amanda

Marcos Dione — Wed, 12 Sep 2012 19:19:23 GMT

Warning: this has not been tested yet.

Again, TL;DR version at the end.

They say that backing up in C* really easy: you just run nodetool snapshot, which only creates a hardlink for each data file somewhere else in the filesystem, and then you just backup those hardlinks. Optionally, when you're done, you simply remove them and that's it.

But that's only the half of the story. The other half is taking those snapshots and storing them somehwere else; let's say, a backup server, so you can restore the data even in case of spontaneous combustion followed by explosion due to shortcircuits caused by your dog peeing on the machine. Not that that happens a lot in a datacenter, but one has to plan for any contingency, right?

In our case we use Amanda, which internally uses an implementation of tar or GNU tar if asked for (yes, also other tools if asked). The problems begin with how you define what to backup and where does C* put those snapshots. The definitions are done by what Amanda calls disklists, which are basically a list of directories to backup entirely. In the other hand, for a column family Bar in a keyspace Foo, whose data are normally stored in <data_file_directory>/Foo/Bar/, a snapshot is located in <data_file_directory>/Foo/Bar/snapshots/<something>, where something can be a timestamp or a name defined by the user at snapshot time.

If you want to simplify your backup configuration, you'll probably will want to say <data_file_directory>/*/*/snapshots/ as the dirs to backup, but Amanda merrily can't expand wildcards in disklists. A way to solve this is to create a directory sibling of <data_file_directory>, move the files in the snapshots there, and specify it in the disklists. That kinda works...

... until your second backup pass comes and you find out that even when you specified an incremental backup, it copies over all the snapshot files again. This is because when a hardlink is created, the ctime of the inode is changed. Guess what tar uses to see if a file has changed... yes, ctime and mtime¹.

So we're back to square one, or zero even. Seems like the only solution is to use C*'s native 'support' for incrementality, but the docs are just a couple of paragraphs that barely explain how they're done (suprise, the same way as the snapshots) and how to activate it, which is the reason why we didn't followed this path from the beginning. So in the end, it seems that you can't use Amanda or tar to make incremental backups, even with the native support.

But then there's a difference between the snapshot and the incremental mode: with the snapshot method, you create the snapshot just before backing it up, which sets all the ctimes to now. C*'s incremental mode "hard-links each flushed SSTable to a backups directory under the keyspace data directory", so they have roughly the same ctime as the mtimes, and neither never ever changes (remember, SSTables are inmutable) again (until we do a snapshot, of course).

One particularity that I noticed is that only new SSTables are backed up, but not those that are the result of compactions. At the beginning I thought this was wrong, but after discussing the issue with driftx in the IRC channel and a confirmation by Tyler Hobbs in the mailing list, we came to the following conclussion: with also compacted SSTables, at restore time you would need to do a manual compaction to minimize data duplication, which otherwise means more SStables associated by the Bloom filters and more disk reads/seeks per get and more space used; but if you don't backup/restore those SStables, the manual compaction is only advisable. Also, as a consequence, you don't need to track which files were deleted between backups.

So the remaining problem is to know which files have been backed up, because C* backups, just like snapshots, are not automatically cleaned. I came up with the following solution, which at the beginning it might seem complicated, but it really isn't.

When we do a snapshot, which is perfect for full backups, we previously remove all the files present in the backup directory; incremental files since the last incremental backup are not needed because we're doing a full anyways. At the end of this we have the files ready for the full; we do the backup, and we erase the files.

Then the following days we just add the dynamic backups so far, preceded by a flush, so as to have the last data in the SSTables and not depend on CommitLogs. As they're only the diff against the files in the full, and not the intermediate compacted SSTables, they're as big as they should (but also as small as they could, if you're worried about disk ussage). Furthermore, the way we put files in the backup dir is via symlinks, so it doesn't change the file's mtime or ctime, and we configure Amanda to dereference symlinks.

Later, at restore time, the files are put in the backup directory, and with a script that takes the KS and CF from the file's name, they're 'dealed' to the right directories.

TL;DR version

Full backup

Remove old incremental files and symlinks.
nodetool snapshot.
Symlink all the snapshot files to a backup directory
Backup that directory dereferencing symlinks.
nodetool clearsnapshot and remove symlinks.

Incremental backup

nodetool flush.
Symlink all incremental files into the bakup directory.
Backup that directory dereferencing symlinks.

Restore²

Restore the last full backup and all the incrementals.

tar's docs are not clear in what exactly it uses, ("Incremental dumps depend crucially on time stamps"), but Amanda's seems to imply such a thing ("Tar has the ability to preserve the access times[;] however, doing so effectively disables incremental backups since resetting the access time alters the inode change time, which in turn causes the file to look like it needs to be archived again.") ↩
Actually is not that simple. The previous post in this series already shows how it could get more complicated. ↩

Cassandra counters are not atomic

Marcos Dione — Fri, 24 Aug 2012 10:28:30 GMT

Today I arrived at work and I was shoved to a scrambled war room. Inside, we the two sysadmins working with C*, our inmediate boss, the DBA, the PHP developer involved in this first migration proyect (from MySQL, this is important) and the project leader replacing the one on vacations. Yesterday before I left I saw a similar but more informal gathering around the other sysadmin who's testing the migration in our preproduction environment. As I was busy in the other corner of the office with my backup tasks (I'm still strugling with a lots of constraints, but I think I finally tackled it down, as in down to the floor. I hope it will just stay still for production... but I digress), so I was unaware the reason of the meeting, except for the title in the mail: "Problem with the counter".

If you followed this history closely, you have all the clues to know where I'm going¹. We're replacing the smallest of our databases, the avatar database, which might be small (~50GiB of data), but is has a great impact, because not only a lot of our pages use it, but also our customers and/or partners. Each image has an unique ID implemented with a MySQL auto increment key. Because of the impact, we couldn't simply go for UUIDs.

Now, we knew that even when C* implements counter columns since v0.8, there is no guarantee that the counter changes are atomic from a cluster point of view. What really surprised us is that this no-guarantee also holds for only one node. In other words, simultaneous changes to a counter (incremens by 1, in our case) are not atomic even when they're done in the same node.

To make sure, I put on my hazmat suit³ and plunged head first, shit shovel in hand (at the end they were not needed, the code is very readable), into C*'s code. After some maneuvering (it's not a straight route, there was some back-and-forth) I got the piece of code that I was looking for. Basically it says that if there are no indexes to update and no deletes, the update is done concurrently without any locks. And you cannot index counters, of course, as it makes no sense.

Clearly the subject of atomic counters is not something that C* plans to fix any time soon², and given the difficulty of it, I can understand that desition. But I would expect atomicity at node level, so if one desires so, one can shoot his own foot "implementing" atomic counters just writing the updates in only one node (and then some medium/heavy infra to implement HA).

One more guy, 90 minutes, 5 possible solutions (including snowflake) later, we desided to temporarily still use MySQL for the counter (remember, we're going online next week) and look closer to more permanent solutions later (this other guy, a sysadmin in another group, has been on-and-off fighting to even make snowflake start in his own machine), including of course, the long term goal of getting rid of this kind of IDs.

Well, if the title didn't gave it away from the beggining. ↩
If you feel curious, start here. ↩
3 years in France has made me somewhat careful... Dammit! ↩