GlusterFS Algorithms: Replication (present)
Replication is the most necessarily complex part of GlusterFS – even more than distribution, which would probably be the most common guess. It’s also one of the things that sets GlusterFS apart from most of its obvious competitors. Many of them simply require that you implement RAID (to protect against disk failures) and heartbeat/failover (to protect against node failures) externally. Thanks, guys. The worst thing about that approach is that it depends on shared storage at least between failover sets, and the whole idea of a scale-out filesystem is supposed to be that it can run on cheap commodity hardware. A few alternatives do implement their own replication, and kudos to them, but I’m not here to describe how other projects do things. In GlusterFS, the approach is based on using extended attributes to mark files as potentially “dirty” while they’re being modified, so that they can be recovered if the modification fails on one replica in the middle. I posted a description of these extended attributes a while back, so here I’ll focus more on the algorithms that use them. The conceptual process of writing to a file consists of the following steps.
- Take out a lock on the file. This prevents simultaneous updates from occurring at the replicas in different orders, which is hard to reconcile. Perhaps even more importantly, it helps to avoid conflicts between a write and “self-heal” (repair) on the same file. It’s easy to focus on conflicts between clients, but this kind of client/internal conflict is important too.
- Mark the “changelog” (extended attributes) on each copy of the file, to represent the pending write. This is not really a log in any meaningful sense of the word, but that’s one among many unique uses of standard terms in the code so I’ll stick with it. It’s really a set of “dirty” counters (see the above link for more detail) which are incremented at this point. The important thing to note here is that each replica contains a changelog for every other replica, so that failure of one replica does not wipe out both the write and the record of its existence.
- Do the actual write on all replicas.
- As each replica write completes, update (decrement) the changelog on the others. Note that this process might be repeated if the replica count is greater than two.
- When all writes and changelog decrements are complete, release the lock.
Note that this is just the conceptual view. In reality, many optimizations can be applied. For example:
- Don’t clear the changelog on X for Y immediately if we’re still waiting to hear from Z, but wait until we can clear both together. Note that this means X’s changelog for Y might be out of date temporarily, leading to a different self-heal result than would be obtained if it were current.
- Skip the changelog increment if another write is already in progress, and likewise skip the decrement when that prior write completes. This is called “changelog piggybacking” because we’re basically letting the second write “ride along” on the first write’s changelog increment.
- Skip the unlock and subsequent lock operations when writes overlap. This is called “eager locking” and is essentially the same as changelog piggybacking except that it’s for lock operations instead of changelog operations. This is also not complete yet.
With these optimizations, the number of network round trips per write can go from an execrable five (or more) down to one – just the write. Even better, with write-behind enabled, these kinds of access patterns become much more likely. Unfortunately, many workloads either can’t allow write-behind or don’t provide an access pattern that it can optimize, so these optimizations won’t be optimized either and IMO treating them as the default for measuring performance is tantamount to cheating on benchmarks.
All of this might seem complex, but the real complexity is in how we use these changelog values to do self-heal. Here, we encounter some more unique terminology based on which replicas have non-zero changelog entries for which others. If X has a non-zero entry for Y, we say that X “accuses” Y (of having incomplete operations), and this leads to the following “characters” for each replica.
- IGNORANT means that the replica doesn’t even have a changelog – e.g. for a file that’s missing on one replica.
- INNOCENT means that the replica doesn’t accuse anyone.
- FOOL means that the replica accuses itself. In other words, it got as far as the changelog increment but not as far as the decrement, so we don’t actually know whether the write in between made it to disk.
- WISE means that the replica doesn’t accuse itself, but does accuse someone else.
The algorithm for determining which way to self-heal is really complicated, so I’ll just hit some of the highlights. You might recall that a replica does not have a changelog entry for itself, so how does it accuse itself (i.e. become a fool)? The secret is that an implicit self-accusation is made in some conditions – I confess that even I don’t fully understand how the code is making this distinction. The key is that this separates the aggregation of state from decisions about state, allowing the decision logic to work the same way in a whole bunch of different conditions. Some of the most common or important cases are:
- All nodes are innocent. I’m not even sure how we’d be self-healing in this case, but what we do is just pick one.
- Only one wise node exists. This is what will happen if that node finished the write and others didn’t (or might not have). The single wise node is clearly the source, and its data is propagated to any others that are less wise.
- Multiple wise nodes exist, but they don’t accuse each other. Just pick one.
- Multiple wise nodes exist, and there are accusations between them. This is the infamous “split brain” which we cannot resolve automatically. I recommend using the quorum enforcement feature to avoid this.
- None of the above apply. Pick the fool who accuses the others most strongly. The theory (I think) is that the node with the highest aggregate pending-operation count for the others is the one who has been up the most while others are down, so it’s most likely to have correct data. I could actually see an argument that this should be treated as split-brain instead, so I’d be interested in hearing others’ thoughts.
There’s a lot more – feel free to explore the vast forest of calls into and out of afr_build_sources if you really want to appreciate how complicated this is – but this is getting long already and we still need to discuss how self-heal is actually triggered. Historically, this has evolved quite a bit over time. The first idea was that it would be done “behind the scenes” when a file is looked up, and users wouldn’t have to worry about. Not too surprisingly, people who actually deployed GlusterFS were uncomfortable with having a potentially large and – more importantly – unknown number of “vulnerable” files after a failure. Thus, it became common for people to run a full scan across the entire volume (using “find | stat” or “ls -alR”) to force a quicker self-heal after a failure. Recently GlusterFS has started to do this automatically through its own self-heal daemon, and even more recently code was added to log which files need self-heal instead of requiring a full scan which can take days to weeks. (This is the same basic idea I had demonstrated back in November of 2010, but is implemented quite differently.) In GlusterFS 3.3 or 3.4, the result will be an automatic and (reasonably) efficient self-heal process, which might be one of the most significant improvements since the new cluster- and volume-management framework was added in 3.1.
I was going to write about some emergent properties of this approach, and some directions for the future, but this post has gotten quite long enough. I’ll save that for next time.
 Update March 13: Avati points out that bricks do have a changelog entry for themselves now, and I’ve verified this to be the case. Mystery solved.