<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="../assets/xml/rss.xsl" media="all"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>.:: Marcos Dione/StyXman's glob ::. (Posts about cassandra)</title><link>https://www.grulic.org.ar/~mdione/glob/</link><description></description><atom:link href="https://www.grulic.org.ar/~mdione/glob/categories/cassandra.xml" rel="self" type="application/rss+xml"></atom:link><language>en</language><copyright>Contents © 2025 &lt;a href="mailto:mdione@grulic.org.ar"&gt;Marcos Dione&lt;/a&gt; </copyright><lastBuildDate>Sat, 15 Nov 2025 20:52:04 GMT</lastBuildDate><generator>Nikola (getnikola.com)</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>Incrementally backing up Cassandra with Amanda</title><link>https://www.grulic.org.ar/~mdione/glob/posts/incrementally-backing-up-cassandra-with-amanda/</link><dc:creator>Marcos Dione</dc:creator><description>&lt;p&gt;&lt;em&gt;Warning&lt;/em&gt;: this has not been tested yet.&lt;/p&gt;
&lt;p&gt;Again, TL;DR version at the end.&lt;/p&gt;
&lt;p&gt;They say that backing up in C* really easy: you just run &lt;code&gt;nodetool
snapshot&lt;/code&gt;, which only creates a hardlink for each data file somewhere
else in the filesystem, and then you just backup those hardlinks.
Optionally, when you're done, you simply remove them and that's it. &lt;/p&gt;
&lt;p&gt;But that's only the half of the story. The other half is taking those
snapshots and storing them somehwere else; let's say, a backup server,
so you can restore the data even in case of spontaneous combustion
followed by explosion due to shortcircuits caused by your dog peeing on
the machine. Not that that happens a lot in a datacenter, but one has to
plan for any contingency, right?&lt;/p&gt;
&lt;p&gt;In our case we use &lt;a href="http://wiki.zmanda.com/"&gt;Amanda&lt;/a&gt;, which internally
uses an implementation of &lt;code&gt;tar&lt;/code&gt; or GNU tar if asked for (yes, also other
tools if asked). The problems begin with how you define what to backup
and where does C* put those snapshots. The definitions are done by what
Amanda calls disklists, which are basically a list of directories to
backup entirely. In the other hand, for a column family Bar in a
keyspace Foo, whose data are normally stored in
&lt;code&gt;&amp;lt;data_file_directory&amp;gt;/Foo/Bar/&lt;/code&gt;, a snapshot is located in
&lt;code&gt;&amp;lt;data_file_directory&amp;gt;/Foo/Bar/snapshots/&amp;lt;something&amp;gt;&lt;/code&gt;, where something
can be a timestamp or a name defined by the user at snapshot time.&lt;/p&gt;
&lt;p&gt;If you want to simplify your backup configuration, you'll probably will
want to say &lt;code&gt;&amp;lt;data_file_directory&amp;gt;/*/*/snapshots/&lt;/code&gt; as the dirs to
backup, but Amanda merrily can't expand wildcards in disklists. A way to
solve this is to create a directory sibling of &lt;code&gt;&amp;lt;data_file_directory&amp;gt;&lt;/code&gt;,
move the files in the snapshots there, and specify it in the disklists.
That kinda works...&lt;/p&gt;
&lt;p&gt;... until your second backup pass comes and you find out that even when
you specified an incremental backup, it copies over all the snapshot
files again. This is because when a hardlink is created, the ctime of
the inode is changed. Guess what &lt;code&gt;tar&lt;/code&gt; uses to see if a file has
changed... yes, ctime and mtime&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="https://www.grulic.org.ar/~mdione/glob/posts/incrementally-backing-up-cassandra-with-amanda/#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;
&lt;p&gt;So we're back to square one, or zero even. Seems like the only solution
is to use C*'s native 'support' for incrementality, but the docs are
just
&lt;a href="http://www.datastax.com/docs/1.0/operations/backup_restore"&gt;a couple of paragraphs&lt;/a&gt;
that barely explain how they're done (suprise, the same way as the
snapshots) and how to activate it, which is the reason why we didn't
followed this path from the beginning. So in the end, it seems that you
can't use Amanda or &lt;code&gt;tar&lt;/code&gt; to make incremental backups, even with the
native support.&lt;/p&gt;
&lt;p&gt;But then there's a difference between the snapshot and the incremental
mode: with the snapshot method, you create the snapshot just before
backing it up, which sets all the ctimes to now. C*'s incremental mode
"hard-links each flushed SSTable to a backups directory under the
keyspace data directory", so they have roughly the same ctime as the
mtimes, and neither never ever changes (remember, SSTables are
inmutable) again (until we do a snapshot, of course).&lt;/p&gt;
&lt;p&gt;One particularity that I noticed is that only new SSTables are backed
up, but not those that are the result of compactions. At the beginning I
thought this was wrong, but after discussing the issue with &lt;code&gt;driftx&lt;/code&gt; in
the IRC channel and a confirmation by Tyler Hobbs in the mailing list,
we came to the following conclussion: with also compacted SSTables, at
restore time you would need to do a manual compaction to minimize data
duplication, which otherwise means more SStables associated by the Bloom
filters and more disk reads/seeks per get and more space used; but if
you don't backup/restore those SStables, the manual compaction is only
advisable. Also, as a consequence, you don't need to track which files
were deleted between backups.&lt;/p&gt;
&lt;p&gt;So the remaining problem is to know which files have been backed up,
because C* backups, just like snapshots, are not automatically cleaned.
I came up with the following solution, which at the beginning it might
seem complicated, but it really isn't.&lt;/p&gt;
&lt;p&gt;When we do a snapshot, which is perfect for full backups, we previously
remove all the files present in  the backup directory; incremental files
since the last incremental backup are not needed because we're doing a
full anyways. At the end of this we have the files ready for the full;
we do the backup, and we erase the files.&lt;/p&gt;
&lt;p&gt;Then the following days we just add the dynamic backups so far, preceded
by a flush, so as to have the last data in the SSTables and not depend
on CommitLogs. As they're only the diff against the files in the full,
and not the intermediate compacted SSTables, they're as big as they
should (but also as small as they could, if you're worried about disk
ussage). Furthermore, the way we put files in the backup dir is via
symlinks, so it doesn't change the file's mtime or ctime, and we
configure Amanda to dereference symlinks.&lt;/p&gt;
&lt;p&gt;Later, at restore time, the files are put in the backup directory, and
with a script that takes the KS and CF from the file's name, they're
'dealed' to the right directories.&lt;/p&gt;
&lt;h2&gt;TL;DR version&lt;/h2&gt;
&lt;h3&gt;Full backup&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Remove old incremental files and symlinks.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;nodetool snapshot&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Symlink all the snapshot files to a backup directory&lt;/li&gt;
&lt;li&gt;Backup that directory dereferencing symlinks.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;nodetool clearsnapshot&lt;/code&gt; and remove symlinks.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Incremental backup&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;nodetool flush&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Symlink all incremental files into the bakup directory.&lt;/li&gt;
&lt;li&gt;Backup that directory dereferencing symlinks.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Restore&lt;sup id="fnref:2"&gt;&lt;a class="footnote-ref" href="https://www.grulic.org.ar/~mdione/glob/posts/incrementally-backing-up-cassandra-with-amanda/#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Restore the last full backup and all the incrementals.&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;&lt;a href="http://sepp.oetiker.ch/tar-1.16.x-mo/tar_37.html#SEC88"&gt;&lt;code&gt;tar&lt;/code&gt;'s docs&lt;/a&gt;
  are not clear in what exactly it uses, ("Incremental dumps depend
  crucially on time stamps"), but
  &lt;a href="http://wiki.zmanda.com/index.php/Exclude_and_include_lists"&gt;Amanda's&lt;/a&gt;
  seems to imply such a thing ("Tar has the ability to preserve the
  access times[;] however, doing so effectively disables incremental
  backups since resetting the access time alters the inode change
  time, which in turn causes the file to look like it needs to be
  archived again.") &lt;a class="footnote-backref" href="https://www.grulic.org.ar/~mdione/glob/posts/incrementally-backing-up-cassandra-with-amanda/#fnref:1" title="Jump back to footnote 1 in the text"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;Actually is not that simple. &lt;a href="https://www.grulic.org.ar/~mdione/glob/posts/restoring-cassandra-online/"&gt;The previous post&lt;/a&gt; in this series already shows how it
could get more complicated. &lt;a class="footnote-backref" href="https://www.grulic.org.ar/~mdione/glob/posts/incrementally-backing-up-cassandra-with-amanda/#fnref:2" title="Jump back to footnote 2 in the text"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</description><category>backup</category><category>cassandra</category><category>sysadmin</category><guid>https://www.grulic.org.ar/~mdione/glob/posts/incrementally-backing-up-cassandra-with-amanda/</guid><pubDate>Wed, 12 Sep 2012 19:19:23 GMT</pubDate></item><item><title>Cassandra counters are not atomic</title><link>https://www.grulic.org.ar/~mdione/glob/posts/cassandra-counters-are-not-atomic/</link><dc:creator>Marcos Dione</dc:creator><description>&lt;p&gt;Today I arrived at work and I was shoved to a scrambled war room. Inside, we the two
sysadmins working with C*, our inmediate boss, the DBA, the PHP developer involved
in this first migration proyect (from MySQL, this is important) and the project
leader replacing the one on vacations. Yesterday before I left I saw a similar
but more informal gathering around the other sysadmin who's testing the migration
in our preproduction environment. As I was busy in the other corner of the office
with my backup tasks (I'm still strugling with a lots of constraints, but I
think I finally tackled it down, as in down to the floor. I hope it will just stay
still for production... but I digress), so I was unaware the reason of the meeting,
except for the title in the mail: "Problem with the counter".&lt;/p&gt;
&lt;p&gt;If you followed this history closely, you have all the clues to know where I'm
going&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="https://www.grulic.org.ar/~mdione/glob/posts/cassandra-counters-are-not-atomic/#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt;. We're replacing the smallest of our databases, the avatar database,
which might be small (~50GiB of data), but is has a great impact, because not only
a lot of our pages use it, but also our customers and/or partners. Each image has
an unique ID implemented with a MySQL auto increment key. Because of the impact,
we couldn't simply go for UUIDs.&lt;/p&gt;
&lt;p&gt;Now, we knew that even when C* implements counter columns since v0.8, there is no
guarantee that the counter changes are atomic from a cluster point of view.
What really surprised us is that this no-guarantee also holds for only one node.
In other words, simultaneous changes to a counter (incremens by 1, in our case)
are not atomic even when they're done in the same node.&lt;/p&gt;
&lt;p&gt;To make sure, I put on my hazmat suit&lt;sup id="fnref:3"&gt;&lt;a class="footnote-ref" href="https://www.grulic.org.ar/~mdione/glob/posts/cassandra-counters-are-not-atomic/#fn:3"&gt;3&lt;/a&gt;&lt;/sup&gt; and plunged
head first, shit shovel in hand (at the end they were not needed, the code is
very readable), into C*'s code. After some maneuvering (it's not
a straight route, there was some back-and-forth) I got
&lt;a href="https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/Table.java#L384"&gt;the piece of code&lt;/a&gt;
that I was looking for. Basically it says that if there are no indexes to update
and no deletes, the update is done concurrently without any locks. And you cannot
index counters, of course, as it makes no sense.&lt;/p&gt;
&lt;p&gt;Clearly the subject of atomic counters is not something that C* plans to fix any
time soon&lt;sup id="fnref:2"&gt;&lt;a class="footnote-ref" href="https://www.grulic.org.ar/~mdione/glob/posts/cassandra-counters-are-not-atomic/#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt;, and given the difficulty of it, I can understand that desition.
But I would expect atomicity at node level, so if one desires so, one can shoot
his own foot "implementing" atomic counters just writing the updates 
in only one node (and then some medium/heavy infra to implement HA).&lt;/p&gt;
&lt;p&gt;One more guy, 90 minutes, 5 possible solutions (including
&lt;a href="https://github.com/twitter/snowflake"&gt;snowflake&lt;/a&gt;) later, we desided to temporarily
still use MySQL for the counter (remember, we're going online next week) and look
closer to more permanent solutions later (this other guy, a sysadmin in another
group, has been on-and-off fighting to even make snowflake start in his own
machine), including of course, the long term goal of getting
rid of this kind of IDs.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;Well, if the title didn't gave it away from the beggining. &lt;a class="footnote-backref" href="https://www.grulic.org.ar/~mdione/glob/posts/cassandra-counters-are-not-atomic/#fnref:1" title="Jump back to footnote 1 in the text"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;If you feel curious, start &lt;a href="https://issues.apache.org/jira/browse/CASSANDRA-721"&gt;here&lt;/a&gt;. &lt;a class="footnote-backref" href="https://www.grulic.org.ar/~mdione/glob/posts/cassandra-counters-are-not-atomic/#fnref:2" title="Jump back to footnote 2 in the text"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:3"&gt;
&lt;p&gt;3 years in France has made me somewhat careful... Dammit! &lt;a class="footnote-backref" href="https://www.grulic.org.ar/~mdione/glob/posts/cassandra-counters-are-not-atomic/#fnref:3" title="Jump back to footnote 3 in the text"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</description><category>cassandra</category><guid>https://www.grulic.org.ar/~mdione/glob/posts/cassandra-counters-are-not-atomic/</guid><pubDate>Fri, 24 Aug 2012 10:28:30 GMT</pubDate></item></channel></rss>