Crowdstrike recently released an "External Technical Root Cause
Analysis"
[PDF] (via)
for their recent extreme failure. The writeup is rather unclear
about what exactly happened, but I think I understand it and if I
do, it's an uncomfortably easy programming mistake to make. So
here is my version of the core programming issue.
Part of Crowdstrike's agent is a signature matching system, where
they match signature patterns (templates) against what's going
on. There are different types of patterns (template types)
depending on what sort of activity the signature pattern is matching.
Rather than have one big regular expression or equivalent in each
pattern that matches against all data for the activity, all smashed
together and encoded somehow, Crowdstrike's code breaks this out
so pattern types have some number of fields (each of them a
separate regular expression or equivalent) and are supplied with
matching input values (or input data) extracted from the activity
in order to do the field by field matching.
(You could invert this so that during the matching of each pattern
(template), the matching code calls back to obtain the input
values from the activity, but since you can have a lot of different
templates for a given template type, it's more efficient to extract
the data once and then reuse it across all of the templates of that
type. You can still do this with the call back pattern, but it's
more complicated.)
For whatever reason, the input values were passed not as separate
arguments but as a single variable size array, the input data
array (although this might not have been as a literal array but
instead as varargs function arguments). It's possible that the core
matching code was in fact generic; it was given a template with
some number of fields and an input data array with some number of
fields and it walked through matching them up. If all of the field
matching genuinely is text regular expression matches, this wouldn't
be a crazy way to structure the code. You'd have one general matching
system plus a collection of template type specific code to that was
passed information about the activity and had the job of pulling
data out of it and populating the input data array for its patterns.
The mismatch occurred when the IPC templates specified and matched
against 21 fields, but the input data array only had 20 pieces of
data. The issue had previously been masked because all previous IPC
templates had a wildcard match for the 21st field and this skipped
the actual comparison. In hand-waved Go code, one version of this
might look like:
func (t *Template) DoesMatch(eventdata SomeType[]) bool {
for i := range t.Fields {
if t.Fields[i] != MatchesAll &&
!FieldMatches(t.Fields[i], eventdata[i]) {
return false
}
}
return true
}
(To be clear, I'm not claiming that this is what Crowdstrike's code
was doing. All we know about the actual code from Crowdstrike's
report is that it involved an arity
mismatch that wasn't detected at compile time.)
Now, you would certainly like this code to start with a check that
'len(t.Fields) == len(eventdata)' so that it doesn't panic if there's
a length mismatch. But you might not put that check in, depending
on the surrounding context. And in general this code and design is,
in my opinion, not particularly bad or unnatural; you might write
something like it in all sorts of situations where you have a
variable amount of data that needs to be provided to something
(or various different things).
Compilers and language environments are broadly not all that good
at statically detecting dynamic arity mismatches; it's a hard problem
involving things like dataflow analysis. You can sometimes change
the code to avoid being dynamic this way, but it might involve
awkward restructuring that leads to other issues. For example, you
might have a per template type 'myEventData' structure with named
fields, insuring that you can never access an undefined field, but
then you probably have to have per template type matching code so
that you can create and fill in the right myEventData structure and
call everything in a type-safe way (although remember to write tests
that verify that every field in the structure is actually filled
in). This would involve duplicating the logic of 'check all of these
templates of the give type' across these functions, and that code
might have complexities of its own.
(Modern language environments have done a lot of work to detect
arity problems in things like printf() or its equivalents, but this
is generally done by specific knowledge of how to determine the
arity (and types) given a format string or the like.)
If you have a master source of truth for what the arity should
be (how many fields or arguments are required), such as a data file
definition, you can build additional tooling to check that the arity
is correct or to automatically generate test cases that will verify
that. But this requires having that source of truth and building
the generally custom tooling to use it. You also have to insure
that the test cases are doing what you think they're doing, which
might have been another issue in play here (the people writing the
IPC template type code and its tests may not have known that wildcard
field matches were short circuited so that they didn't even look
at the matching element of the input data array).
(IDLs
also don't have a great reputation among programmers, partly because
there have been a lot of bad IDLs over the years that people have
been forced to deal with.)
There are also intermediate positions between the fully dynamic
arity issue that Crowdstrike apparently had and the fully type safe
version you could write with enough work. But they all involve some
runtime dynamic behavior, which means at least a possibility for
runtime mistakes, which would have to be carefully caught lest they
crash your code. For instance, in Go you could have a per template
type structure, pass it through the common code as an interface{}
,
and then type-assert it back to
the real structure type in your per template type matcher function.
But you'd have to handle the possibility of that type assertion
failing, however unlikely it should be.
PS: None of this should be taken as excusing Crowdstrike. They were
writing software that ran in an extremely privileged and important
situation, and they should have done that (much) better.
PPS: Neither Go nor Rust by themselves will save you from this
specific sort of dynamic arity mistake; they merely change how your
code will fail, from an invalid memory dereference crash to a panic
in the language runtime or your (macro-expanded) code.