Warning: this has not been tested yet.

Again, TL;DR version at the end.

They say that backing up in C* really easy: you just run nodetool snapshot, which only creates a hardlink for each data file somewhere else in the filesystem, and then you just backup those hardlinks. Optionally, when you're done, you simply remove them and that's it.

But that's only the half of the story. The other half is taking those snapshots and storing them somehwere else; let's say, a backup server, so you can restore the data even in case of spontaneous combustion followed by explosion due to shortcircuits caused by your dog peeing on the machine. Not that that happens a lot in a datacenter, but one has to plan for any contingency, right?

In our case we use Amanda, which internally uses an implementation of tar or GNU tar if asked for (yes, also other tools if asked). The problems begin with how you define what to backup and where does C* put those snapshots. The definitions are done by what Amanda calls disklists, which are basically a list of directories to backup entirely. In the other hand, for a column family Bar in a keyspace Foo, whose data are normally stored in <data_file_directory>/Foo/Bar/, a snapshot is located in <data_file_directory>/Foo/Bar/snapshots/<something>, where something can be a timestamp or a name defined by the user at snapshot time.

If you want to simplify your backup configuration, you'll probably will want to say <data_file_directory>/*/*/snapshots/ as the dirs to backup, but Amanda merrily can't expand wildcards in disklists. A way to solve this is to create a directory sibling of <data_file_directory>, move the files in the snapshots there, and specify it in the disklists. That kinda works...

... until your second backup pass comes and you find out that even when you specified an incremental backup, it copies over all the snapshot files again. This is because when a hardlink is created, the ctime of the inode is changed. Guess what tar uses to see if a file has changed... yes, ctime and mtime[1].

So we're back to square one, or zero even. Seems like the only solution is to use C*'s native 'support' for incrementality, but the docs are just a couple of paragraphs that barely explain how they're done (suprise, the same way as the snapshots) and how to activate it, which is the reason why we didn't followed this path from the beginning. So in the end, it seems that you can't use Amanda or tar to make incremental backups, even with the native support.

But then there's a difference between the snapshot and the incremental mode: with the snapshot method, you create the snapshot just before backing it up, which sets all the ctimes to now. C*'s incremental mode "hard-links each flushed SSTable to a backups directory under the keyspace data directory", so they have roughly the same ctime as the mtimes, and neither never ever changes (remember, SSTables are inmutable) again (until we do a snapshot, of course).

One particularity that I noticed is that only new SSTables are backed up, but not those that are the result of compactions. At the beginning I thought this was wrong, but after discussing the issue with driftx in the IRC channel and a confirmation by Tyler Hobbs in the mailing list, we came to the following conclussion: with also compacted SSTables, at restore time you would need to do a manual compaction to minimize data duplication, which otherwise means more SStables associated by the Bloom filters and more disk reads/seeks per get and more space used; but if you don't backup/restore those SStables, the manual compaction is only advisable. Also, as a consequence, you don't need to track which files were deleted between backups.

So the remaining problem is to know which files have been backed up, because C* backups, just like snapshots, are not automatically cleaned. I came up with the following solution, which at the beginning it might seem complicated, but it really isn't.

When we do a snapshot, which is perfect for full backups, we previously remove all the files present in the backup directory; incremental files since the last incremental backup are not needed because we're doing a full anyways. At the end of this we have the files ready for the full; we do the backup, and we erase the files.

Then the following days we just add the dynamic backups so far, preceded by a flush, so as to have the last data in the SSTables and not depend on CommitLogs. As they're only the diff against the files in the full, and not the intermediate compacted SSTables, they're as big as they should (but also as small as they could, if you're worried about disk ussage). Furthermore, the way we put files in the backup dir is via symlinks, so it doesn't change the file's mtime or ctime, and we configure Amanda to dereference symlinks.

Later, at restore time, the files are put in the backup directory, and with a script that takes the KS and CF from the file's name, they're 'dealed' to the right directories.

TL;DR version:

  • Full backup:

    • Remove old incremental files and symlinks.
    • nodetool snapshot.
    • Symlink all the snapshot files to a backup directory
    • Backup that directory dereferencing symlinks.
    • nodetool clearsnapshot and remove symlinks.
  • Incremental backup:

    • nodetool flush.
    • Symlink all incremental files into the bakup directory.
    • Backup that directory dereferencing symlinks.
  • Restore[2]:

    • Restore the last full backup and all the incrementals.

[1] tar's docs are not clear in what exactly it uses, ("Incremental dumps depend crucially on time stamps"), but Amanda's seems to imply such a thing ("Tar has the ability to preserve the access times[;] however, doing so effectively disables incremental backups since resetting the access time alters the inode change time, which in turn causes the file to look like it needs to be archived again.")

[2] Actually is not that simple. The previous post in this series already shows how it could get more complicated.


sysadmin cassandra

Posted Wed 12 Sep 2012 08:17:33 PM CEST Tags: cassandra

Today I arrived at work and I was shoved to a scrambled war room. Inside, we the two sysadmins working with C*, our inmediate boss, the DBA, the PHP developer involved in this first migration proyect (from MySQL, this is important) and the project leader replacing the one on vacations. Yesterday before I left I saw a similar but more informal gathering around the other sysadmin who's testing the migration in our preproduction environment. As I was busy in the other corner of the office with my backup tasks (I'm still strugling with a lots of constraints, but I think I finally tackled it down, as in down to the floor. I hope it will just stay still for production... but I digress), so I was unaware the reason of the meeting, except for the title in the mail: "Problem with the counter".

If you followed this history closely, you have all the clues to know where I'm going[1]. We're replacing the smallest of our databases, the avatar database, which might be small (~50GiB of data), but is has a great impact, because not only a lot of our pages use it, but also our customers and/or partners. Each image has an unique ID implemented with a MySQL auto increment key. Because of the impact, we couldn't simply go for UUIDs.

Now, we knew that even when C* implements counter columns since v0.8, there is no guarantee that the counter changes are atomic from a cluster point of view. What really surprised us is that this no-guarantee also holds for only one node. In other words, simultaneous changes to a counter (incremens by 1, in our case) are not atomic even when they're done in the same node.

To make sure, I put on my hazmat suit[3] and plunged head first, shit shovel in hand (at the end they were not needed, the code is very readable), into C*'s code. After some maneuvering (it's not a straight route, there was some back-and-forth) I got the piece of code that I was looking for. Basically it says that if there are no indexes to update and no deletes, the update is done concurrently without any locks. And you cannot index counters, of course, as it makes no sense.

Clearly the subject of atomic counters is not something that C* plans to fix any time soon[2], and given the difficulty of it, I can understand that desition. But I would expect atomicity at node level, so if one desires so, one can shoot his own foot "implementing" atomic counters just writing the updates in only one node (and then some medium/heavy infra to implement HA).

One more guy, 90 minutes, 5 possible solutions (including snowflake) later, we desided to temporarily still use MySQL for the counter (remember, we're going online next week) and look closer to more permanent solutions later (this other guy, a sysadmin in another group, has been on-and-off fighting to even make snowflake start in his own machine), including of course, the long term goal of getting rid of this kind of IDs.


[1] Well, if the title didn't gave it away from the beggining.

[2] If you feel curious, start here.

[3] 3 years in France has made me somewhat careful... Dammit!


cassandra

Posted Fri 24 Aug 2012 12:27:23 AM CEST Tags: cassandra

Still playing with Cassandra, we setup a cluster of 5 nodes to test backing up and restoring. Datastax' doc only takes in account a simple case: where you only have to replace a node that's failing or whose files were corrupted. In this case the restore is quite straightforward: take the node out of the cluster, delete the commit logs, restore the data you have and re-add the node to the cluster, with an optional repair afterwards.

In our case in particular, we not only contemplate this kind of cases, but we also might need to rollback to a point in the past, which implies restoring the data on all the nodes. It's true that this is possible repeating the above algorithm node by node[1], without the eventual repair. This means that while you're repairnig the CF or KS, your cluster is almost constantly one node less. This might mean nothing on big deployments, but our production cluster is a humble 4 nodes one, even smaller than the testing one! So having as less downtime as possible is highly needed.

So we set off to find a way to do it without stopping the nodes. Some people were advising on using nodetool refresh or sstableloader, but that seems to work only when restoring one node from scratch; that is, the same case as at the beginning. In our case, sstableloader was making no difference. I assume that it's becasuse it's inserting the data with their original timestamp, so the data with newer timestamps still in the Mem/SSTables in the nodes take precedence. That is, sstableloader seems to not replace the data.

With nodetool refresh the same happens, but you still have the option of deleting the current SSTables after a nodetool flush. But that leads to a state where the node(s) where you have done this emit this error on any operation on the CF or KS:

java.io.IOError: java.io.FileNotFoundException: /var/opt/hosting/db/cassandra/data/one_cf/cf_1/one_cf-cf_1-hd-13-Data.db (No such file or directory)

It's not obvious from the example I show, but that's exactly one of the SSTables I just removed. That is, C* still tries to read the SSTables that were there no more even after a nodetool refresh. Maybe this is a bug, but then that commmand's semantic is not clearly stated anywhere.

I found a simple workaround: as we're no longer interested in the data as it is in its current state, I can simply drop the KS of CF and rebuild it after with the data I get from the restore.

In the end, the procedure is like this:

  • Drop and recreate the CF or KS.
  • For all nodes, in parallel if posssible:
    • Remove the snapshot created at drop time[2].
    • Restore the snapshot and move the data files to the right place.
    • nodetool refresh.

[1] Of course how many minimal nodes you need to restore depends on how easy is to restore the data in the nodes and how's your data's replication options and factors.

[2] I'm not sure this is documented anywhere.


sysadmin cassandra

Posted Wed 15 Aug 2012 10:21:57 AM CEST Tags: cassandra

One of my job's new developments is that we'll start using Cassandra as the database for some of our webservices. The move was decided mainly because of the lack of SPoF and easy adition of a column, which happens rather often in our environment.

One of the tasks we've are in charge of is to monitor the system. Most of the interesting values to monitor in a Cassandra setup can be obtained with various commands of nodetool, but not values from the JVM running the Cassandra instance. So I turned to my closest Java guru, who recommended doing a script in groovy. After playing a little with the Java-like language, I got this:

import javax.management.ObjectName
import javax.management.remote.JMXConnector
import javax.management.remote.JMXConnectorFactory
import javax.management.remote.JMXServiceURL

jmxEnv = [(JMXConnector.CREDENTIALS): (String[])["user", "pass"]]

def serverUrl = 'service:jmx:rmi:///jndi/rmi://localhost:7199/jmxrmi'
def server = JMXConnectorFactory.connect(new JMXServiceURL (serverUrl), jmxEnv).MBeanServerConnection

mBeanNames= [ 
    "java.lang:type=ClassLoading", 
    "java.lang:type=Compilation", 
    "java.lang:type=Memory", 
    "java.lang:type=Threading",

    "org.apache.cassandra.db:type=Caches",
    "org.apache.cassandra.db:type=Commitlog",
    "org.apache.cassandra.db:type=CompactionManager",
    "org.apache.cassandra.db:type=StorageProxy",
    "org.apache.cassandra.db:type=StorageService",

    "org.apache.cassandra.internal:type=AntiEntropyStage",
    "org.apache.cassandra.internal:type=FlushWriter",
    "org.apache.cassandra.internal:type=GossipStage",
    "org.apache.cassandra.internal:type=HintedHandoff",
    "org.apache.cassandra.internal:type=InternalResponseStage",
    "org.apache.cassandra.internal:type=MemtablePostFlusher",
    "org.apache.cassandra.internal:type=MigrationStage",
    "org.apache.cassandra.internal:type=MiscStage",
    "org.apache.cassandra.internal:type=StreamStage",

    "org.apache.cassandra.metrics:type=ClientRequestMetrics,name=ReadTimeouts",
    "org.apache.cassandra.metrics:type=ClientRequestMetrics,name=ReadUnavailables",
    "org.apache.cassandra.metrics:type=ClientRequestMetrics,name=WriteTimeouts",
    "org.apache.cassandra.metrics:type=ClientRequestMetrics,name=WriteUnavailables",

    "org.apache.cassandra.net:type=FailureDetector",
    "org.apache.cassandra.net:type=MessagingService",
    "org.apache.cassandra.net:type=StreamingService",


    "org.apache.cassandra.request:type=MutationStage",
    "org.apache.cassandra.request:type=ReadRepairStage",
    "org.apache.cassandra.request:type=ReadStage",
    "org.apache.cassandra.request:type=ReplicateOnWriteStage",
    "org.apache.cassandra.request:type=RequestResponseStage",
    ]

def dumpMBean= { name ->
    println "$name:"

    // get a proxy MBean for the class
    bean= new GroovyMBean (server, name)
    // get the attributes
    attrs= bean.listAttributeNames ()
    // get an AttrlibuteList, previous cast (!) of Array<String> to String[]
    attrMap= server.getAttributes (bean.name(), (String [])attrs)

    attrMap.each { kv ->
        // kv is an Attribute
        key= kv.name
        // skip RangeKeySample, it can be 15MiB big or more...
        if (key!="RangeKeySample") {
            value= kv.value
            println "\t$key: $value"
        }
    }

    println ""
}

// dump singletons
mBeanNames.each { name ->
    dumpMBean (name)
}

// dump keyspaces and their column families
args.each { ks_cfs ->
    split= ks_cfs.tokenize ('=')
    ks= split[0]
    cfs= split[1].tokenize (',')

    cfs.each { cf ->
        dumpMBean ("org.apache.cassandra.db:type=ColumnFamilies,keyspace=$ks,columnfamily=$cf")
    }
}

In particular we dump its output to a text file and we process it afterwards to pick the values we want to monitor and graph. As we're not yet in production, we hadn't settled on which values we're going to monitor.


sysadmin cassandra

Posted Tue 14 Aug 2012 10:23:41 AM CEST Tags: cassandra