Still playing with Cassandra, we setup a cluster of 5 nodes to test backing up and restoring. Datastax' doc only takes in account a simple case: where you only have to replace a node that's failing or whose files were corrupted. In this case the restore is quite straightforward: take the node out of the cluster, delete the commit logs, restore the data you have and re-add the node to the cluster, with an optional repair afterwards.

In our case in particular, we not only contemplate this kind of cases, but we also might need to rollback to a point in the past, which implies restoring the data on all the nodes. It's true that this is possible repeating the above algorithm node by node[1], without the eventual repair. This means that while you're repairnig the CF or KS, your cluster is almost constantly one node less. This might mean nothing on big deployments, but our production cluster is a humble 4 nodes one, even smaller than the testing one! So having as less downtime as possible is highly needed.

So we set off to find a way to do it without stopping the nodes. Some people were advising on using nodetool refresh or sstableloader, but that seems to work only when restoring one node from scratch; that is, the same case as at the beginning. In our case, sstableloader was making no difference. I assume that it's becasuse it's inserting the data with their original timestamp, so the data with newer timestamps still in the Mem/SSTables in the nodes take precedence. That is, sstableloader seems to not replace the data.

With nodetool refresh the same happens, but you still have the option of deleting the current SSTables after a nodetool flush. But that leads to a state where the node(s) where you have done this emit this error on any operation on the CF or KS: /var/opt/hosting/db/cassandra/data/one_cf/cf_1/one_cf-cf_1-hd-13-Data.db (No such file or directory)

It's not obvious from the example I show, but that's exactly one of the SSTables I just removed. That is, C* still tries to read the SSTables that were there no more even after a nodetool refresh. Maybe this is a bug, but then that commmand's semantic is not clearly stated anywhere.

I found a simple workaround: as we're no longer interested in the data as it is in its current state, I can simply drop the KS of CF and rebuild it after with the data I get from the restore.

In the end, the procedure is like this:

  • Drop and recreate the CF or KS.
  • For all nodes, in parallel if posssible:
    • Remove the snapshot created at drop time[2].
    • Restore the snapshot and move the data files to the right place.
    • nodetool refresh.

[1] Of course how many minimal nodes you need to restore depends on how easy is to restore the data in the nodes and how's your data's replication options and factors.

[2] I'm not sure this is documented anywhere.

sysadmin cassandra