There was this team which was running a pretty complicated data
storage, leader election and "discovery" service. They had something
like 3200 machines and had something like 300 different
clusters/cells/ensembles/...(*) running across them. This service ran
something kind of like etcd, only not that.
The way it worked was that a bunch of "participant" machines would start
an election process, and then they'd decide who was going to lead them
for a while. That leader got to handle all of the write traffic and it
did all of the usual raft/paxos-ish spooky coordination stuff amongst
the participants, including updating the others, and dealing with hosts
that go away and come back later, and so on. It's all table stakes for
this kind of service.
This group of clusters had started out relatively simple but had grown
into a monster over the years. Nobody probably expected them to have
hundreds of clusters and thousands of machines, but they now did, and
they were having trouble keeping track of everything. There were
constant outages, and since they were so low in the stack, when they
broke, lots of other stuff broke.
I wanted to know just what the ground truth looked like and so started
something really stupid from my development machine. It would take a
list of their servers and would crawl them, interrogating the TCP ports
on which the service ran. This was only about 10 ports per machine, so
while it sounded obnoxiously high, it was still possible for prototyping
purposes.
On these ports, there were simple text-based commands which could be
sent, and it would return config information about what that particular
instance was running. It was possible to derive the identity of the
cluster from that. Given all of this and a scrape of the entire fleet,
it was possible to see which host+port combinations were actually
supporting any given cluster, and thus see how well they were doing.
Early results from this terrible manual scraping started showing promise.
Misconfigurations were showing up all over the place - clusters that are
supposed to have 5 hosts but only have 3 in practice with the other two
missing in action somewhere, clusters with non-standard host counts,
clusters in the wrong spots, and so on.
To get away from the "printf | nc the world in cron" thing, we wound up
writing this dumb little agent thing that would run on all of the ~3200
hosts. It would do the same crawling, but it would happen over loopback
so it was a good bit faster by removing long hauls over the production
network from the equation. It also took the load of polling ~32000
ports off my singular machine, and was inherently parallel.
It was now possible to just query an agent and get a list of everything
running on that box. It would refresh things every minute, so it was
far more current than my terrible script which might run every couple of
hours (since it was so slow). This made things even better, and so we
needed an aggregator.
We did some magic to make each of these agents create a little "beacon"
somewhere any time they were run. Our "aggregator" process would start
up and would subscribe to the spot where beacons were being created. It
would then schedule the associated host for checks, where it would
speak to the agent on that host and ask for a copy of its results.
So now we had an agent on every one of the ~3200 hosts, each polling 10
local ports, plus an aggregator that talked to the ~3200 agents and
refreshed the data from them.
Finally, all of the data was available in one place with a single query
that was really fast. The next step was to write a bunch of simple
"dashboard" web pages which allowed anyone to look at the entire fleet,
or to narrow it down by certain parameters - a given cluster (of these
servers), a given region, data center, whatever.
With all of this visible with just a few clicks, it was pretty clear
that we needed something more to actually find the badness for us. It
was all well and good to go clicking around while knowing what things
are supposed to look like, but there were supposed to be rules about
this sort of thing: this many hosts in a cluster, no more than N hosts
per failure domain, and more.
...
Failure domains are a funny thing. Let's say you have five hosts which
form a quorum and which are supposed to be high-availability. You'd
probably want to spread them around, right? If they were serving
clients from around the world, maybe you'd put them in different
locations and never put two in the same spot? If something violated
that, how would you know?
Here's an example of bad placement. We had this one cluster which was
supposed to be spread out throughout an entire region which was composed
of multiple datacenter buildings, each with multiple (compute) clusters
in it, with different racks and so on down the line. But, because it
had been turned up early in the life of that region when only a handful
of hosts had existed, all of them were in the same two or three racks.
Worse still, those racks were physically adjacent. Put another way, if
the servers had arms and hands, they could have high-fived each other
across the hot and cold aisles in the datacenter suite. That's how
close together they were. One bad event in a certain spot would have
wiped out all of their data.
We had to write a schema which would let us express limits for a given
cluster - how many regions it should be in, the maximum number of
members per host, rack, (compute) cluster, building, region, etc. Then
we wrote a tool to let us create rules, and then started using that to
churn out rulesets. Next we came up with some tools which would fetch
the current state of affairs (from the agent/aggr combo) and compare it
to the rulesets. Anything out of "compliance" would show up right away.
...
Then there was the problem of managing the actual ~3200 hosts. With a
footprint that big, there's always something happening. A location gets
turned up and new hosts appear. Another location is taken down after
the machines get too old and those hosts go away. We kept having
outages where a decom would be scheduled, and then someone far away
would run a script with a bunch of --force type commands, and it would
just yank the machines and wipe them. It had no regard for what they
were actually doing, and they managed to clobber a bunch of stuff this
way. It just kept happening.
This is when I had to do something that does not scale. I said to the
decom crew that they should treat any host owned by this team as off
limits because we do not have things under control. That means
never *ever* running a decom script against these hosts while they
are still owned by the team.
I further added that while we're working to get things under control,
if for some reason a decom is blocked due to this decree of mine, they
are to contact me, any time of day or night, and I will get them
unblocked... somehow. I figured it was my way of showing that I had
"skin in the game" for making such a stupid and unreasonable demand.
I've often said that the way to get something fixed is to make sure
someone is in the path of the badness so they will feel it when
something screws up. This was my way of doing exactly that.
We stopped having decom-related outages. We instead started having
these "fire drill" type events where one or two people on the team
(and me) would have to drop what they were doing and spend a few hours
manually replacing machines in various clusters to free them up.
Obviously, this couldn't stand, and so we started in on another project.
This one was more of a "fleet manager", where a dumb little service
would keep track of which machines the team owned, and it would store a
series of bits for each one that I called "intents".
There were only three bits per host: drain, release, freeze. Not all
combinations were valid.
If no bits were set on a host, that meant it was intended for production
use. If it has a server on it, that's fine. If someone needs a
replacement, it's potentially available (assuming it meets the other
requirements, like being far enough away from the other participants).
If the "drain" bit was set, that meant it was not supposed to be
serving. Any server on it should be taken off by replacing it with
an available host which itself isn't marked for "drain" (or worse).
The "release" bit meant that if a host no longer had anything running
on it, then it should be released back to the machine provisioning
system. In doing this, the name of the machine changed, and thus the
ownership (and responsibility) for it left the team, and it was no
longer our problem. The people doing decoms would take it from there.
"Freeze" was a special bit which was intended as a safety mechanism to
stop a runaway automation system. If that bit was set on a host, none
of the tools would change anything on it. It's one of those things
where you should never need to use it, but you'll be sorry if you don't
write it and then need it some day.
"Drain" + "release" meant "keep trying to kick instances off this host
and don't add any new ones", and then "once it becomes empty, give it
back".
Other combinations of the bits (like "release" without "drain") were
invalid and were rejected by the automation.
I should note that this was meant to be level-triggered, meaning on
every single pass, if a host had a bit set and yet wasn't matching up
with that intent or those intents, something should try to drain it, or
give it away, or whatever. Even if it failed, it should try again on
the next pass, and failures should be unusual and thus reported to the
humans.
...
Then there was also the pre-existing system which took config files and
used it to install instances on machines. This system worked just fine,
but it only did that part of the process. It didn't close the loop and
so many parts of the service lifecycle wound up unmanaged by it.
Looking back at this, you can now see that we could establish a bunch of
"sets" with the data available.
Configs: "where we told it to run"
Agent + aggregator: "where it's actually managing to run"
Checker: "what rules these things should be obeying"
Fleet manager: "which machines should be serving (or not), which
machines we should hang onto (or give back)".
Doing different operations on those sets yielded different things.
[configs] x [agent/aggr] = hosts which are doing what they are supposed
to be doing, hosts which are supposed to be serving but aren't for some
reason, and hosts which are NOT supposed to be running but are running
it anyway. It would find sick machines, failures in the config system,
weird hand-installed hack jobs in dark corners, and worse.
[agent/aggr] x [checker] = clusters which are actually spread out
correctly, and clusters which are actually spread out incorrectly,
(possibly because of bad configs, but could be any reason).
[agent/aggr] x [fleet manager] = hosts which are serving where that's
okay, hosts which need to be drained until empty, and hosts which are
now empty and can be given back.
[configs] x [checker] = are out-of-spec clusters due to the configs
telling them to be in the wrong spot, or is something else going on?
You don't really need to do this one, since if the first one checks out,
then you know that everything is running exactly what it was told to
run.
[configs] x [fleet manager] = if you ever get to a point where you
completely trust that the configs are being implemented by the machines
(because some other set operations are clear), then you could find
mismatches this way. You wouldn't necessarily have to resort to the
empirical data, and indeed, could stop scanning for it.
For that matter, the whole port-scanning agent/aggr combination
shouldn't have needed to exist in theory, but in practice, independent
verification was needed.
I should point out that my engagement with this team was not viewed
kindly by management, and my reports about what had been going on
ultimately got me in trouble more than anything else. It's kind of
amazing, considering I was working with them as a result of a direct
request for reliability help, but shooting the messenger is nothing new.
This engagement taught me that a lot of so-called technical problems are
in fact rooted in human issues, and those usually come from management.
There's more that happened as part of this whole process, but this post
has gotten long enough.
...
(*) I'm using "clusters" here to primarily refer to the groups of 5, 7,
or 9 hosts which participated in a quorum and kept the state of the
world in sync. Note that there's also the notion of a "compute
cluster", which is just a much larger group of perhaps tens of thousands
of machines (all with various owners), and that does show up in this
post in a couple of places, and is called out explicitly when it does.