Normal view

There are new articles available, click to refresh the page.
Today — 1 July 2025Computer+Things

You can cheat a test suite with a big enough polynomial

24 June 2025 at 16:27

Hi nerds, I'm back from Systems Distributed! I'd heartily recommend it, wildest conference I've been to in years. I have a lot of work to catch up on, so this will be a short newsletter.

In an earlier version of my talk, I had a gag about unit tests. First I showed the test f([1,2,3]) == 3, then said that this was satisfied by f(l) = 3, f(l) = l[-1], f(l) = len(l), f(l) = (129*l[0]-34*l[1]-617)*l[2] - 443*l[0] + 1148*l[1] - 182. Then I progressively rule them out one by one with more unit tests, except the last polynomial which stubbornly passes every single test.

If you're given some function of f(x: int, y: int, …): int and a set of unit tests asserting specific inputs give specific outputs, then you can find a polynomial that passes every single unit test.

To find the gag, and as SMT practice, I wrote a Python program that finds a polynomial that passes a test suite meant for max. It's hardcoded for three parameters and only finds 2nd-order polynomials but I think it could be generalized with enough effort.

The code

Full code here, breakdown below.

from z3 import *  # type: ignore
s1, s2 = Solver(), Solver()

Z3 is just the particular SMT solver we use, as it has good language bindings and a lot of affordances.

As part of learning SMT I wanted to do this two ways. First by putting the polynomial "outside" of the SMT solver in a python function, second by doing it "natively" in Z3. I created two solvers so I could test both versions in one run.

a0, a, b, c, d, e, f = Consts('a0 a b c d e f', IntSort())
x, y, z = Ints('x y z')
t = "a*x+b*y+c*z+d*x*y+e*x*z+f*y*z+a0"

Both Const('x', IntSort()) and Int('x') do the exact same thing, the latter being syntactic sugar for the former. I did not know this when I wrote the program.

To keep the two versions in sync I represented the equation as a string, which I later eval. This is one of the rare cases where eval is a good idea, to help us experiment more quickly while learning. The polynomial is a "2nd-order polynomial", even though it doesn't have x^2 terms, as it has xy and xz terms.

lambdamax = lambda x, y, z: eval(t)

z3max = Function('z3max', IntSort(), IntSort(), IntSort(),  IntSort())
s1.add(ForAll([x, y, z], z3max(x, y, z) == eval(t)))

lambdamax is pretty straightforward: create a lambda with three parameters and eval the string. The string "a*x" then becomes the python expression a*x, a is an SMT symbol, while the x SMT symbol is shadowed by the lambda parameter. To reiterate, a terrible idea in practice, but a good way to learn faster.

z3max function is a little more complex. Function takes an identifier string and N "sorts" (roughly the same as programming types). The first N-1 sorts define the parameters of the function, while the last becomes the output. So here I assign the string identifier "z3max" to be a function with signature (int, int, int) -> int.

I can load the function into the model by specifying constraints on what z3max could be. This could either be a strict input/output, as will be done later, or a ForAll over all possible inputs. Here I just use that directly to say "for all inputs, the function should match this polynomial." But I could do more complicated constraints, like commutativity (f(x, y) == f(y, x)) or monotonicity (Implies(x < y, f(x) <= f(y))).

Note ForAll takes a list of z3 symbols to quantify over. That's the only reason we need to define x, y, z in the first place. The lambda version doesn't need them.

inputs = [(1,2,3), (4, 2, 2), (1, 1, 1), (3, 5, 4)]

for g in inputs:
    s1.add(z3max(*g) == max(*g))
    s2.add(lambdamax(*g) == max(*g))

This sets up the joke: adding constraints to each solver that the polynomial it finds must, for a fixed list of triplets, return the max of each triplet.

for s, func in [(s1, z3max), (s2, lambdamax)]:
    if s.check() == sat:
        m = s.model()
        for x, y, z in inputs:
            print(f"max([{x}, {y}, {z}]) =", m.evaluate(func(x, y, z)))
        print(f"max([x, y, z]) = {m[a]}x + {m[b]}y",
            f"+ {m[c]}z +", # linebreaks added for newsletter rendering
            f"{m[d]}xy + {m[e]}xz + {m[f]}yz + {m[a0]}\n")

Output:

max([1, 2, 3]) = 3
# etc
max([x, y, z]) = -133x + 130y + -10z + -2xy + 62xz + -46yz + 0

max([1, 2, 3]) = 3
# etc
max([x, y, z]) = -17x + 16y + 0z + 0xy + 8xz + -6yz + 0

I find that z3max (top) consistently finds larger coefficients than lambdamax does. I don't know why.

Practical Applications

Test-Driven Development recommends a strict "red-green refactor" cycle. Write a new failing test, make the new test pass, then go back and refactor. Well, the easiest way to make the new test pass would be to paste in a new polynomial, so that's what you should be doing. You can even do this all automatically: have a script read the set of test cases, pass them to the solver, and write the new polynomial to your code file. All you need to do is write the tests!

Pedagogical Notes

Writing the script took me a couple of hours. I'm sure an LLM could have whipped it all up in five minutes but I really want to learn SMT and LLMs may decrease learning retention.1 Z3 documentation is not... great for non-academics, though, and most other SMT solvers have even worse docs. One useful trick I use regularly is to use Github code search to find code using the same APIs and study how that works. Turns out reading API-heavy code is a lot easier than writing it!

Anyway, I'm very, very slowly feeling like I'm getting the basics on how to use SMT. I don't have any practical use cases yet, but I wanted to learn this skill for a while and glad I finally did.


  1. Caveat I have not actually read the study, for all I know it could have a sample size of three people, I'll get around to it eventually 

Solving LinkedIn Queens with SMT

12 June 2025 at 15:43

No newsletter next week

I’ll be speaking at Systems Distributed. My talk isn't close to done yet, which is why this newsletter is both late and short.

Solving LinkedIn Queens in SMT

The article Modern SAT solvers: fast, neat and underused claims that SAT solvers1 are "criminally underused by the industry". A while back on the newsletter I asked "why": how come they're so powerful and yet nobody uses them? Many experts responded saying the reason is that encoding SAT kinda sucked and they rather prefer using tools that compile to SAT.

I was reminded of this when I read Ryan Berger's post on solving “LinkedIn Queens” as a SAT problem.

A quick overview of Queens. You’re presented with an NxN grid divided into N regions, and have to place N queens so that there is exactly one queen in each row, column, and region. While queens can be on the same diagonal, they cannot be adjacently diagonal.

(Important note: Linkedin “Queens” is a variation on the puzzle game Star Battle, which is the same except the number of stars you place in each row/column/region varies per puzzle, and is usually two. This is also why 'queens' don’t capture like chess queens.)

An image of a solved queens board. Copied from https://ryanberger.me/posts/queens

Ryan solved this by writing Queens as a SAT problem, expressing properties like "there is exactly one queen in row 3" as a large number of boolean clauses. Go read his post, it's pretty cool. What leapt out to me was that he used CVC5, an SMT solver.2 SMT solvers are "higher-level" than SAT, capable of handling more data types than just boolean variables. It's a lot easier to solve the problem at the SMT level than at the SAT level. To show this, I whipped up a short demo of solving the same problem in Z3 (via the Python API).

Full code here, which you can compare to Ryan's SAT solution here. I didn't do a whole lot of cleanup on it (again, time crunch!), but short explanation below.

The code

from z3 import * # type: ignore
from itertools import combinations, chain, product
solver = Solver()
size = 9 # N

Initial setup and modules. size is the number of rows/columns/regions in the board, which I'll call N below.

# queens[n] = col of queen on row n
# by construction, not on same row
queens = IntVector('q', size) 

SAT represents the queen positions via N² booleans: q_00 means that a Queen is on row 0 and column 0, !q_05 means a queen isn't on row 0 col 5, etc. In SMT we can instead encode it as N integers: q_0 = 5 means that the queen on row 0 is positioned at column 5. This immediately enforces one class of constraints for us: we don't need any constraints saying "exactly one queen per row", because that's embedded in the definition of queens!

(Incidentally, using 0-based indexing for the board was a mistake on my part, it makes correctly encoding the regions later really painful.)

To actually make the variables [q_0, q_1, …], we use the Z3 affordance IntVector(str, n) for making n variables at once.

solver.add([And(0 <= i, i < size) for i in queens])
# not on same column
solver.add(Distinct(queens))

First we constrain all the integers to [0, N), then use the incredibly handy Distinct constraint to force all the integers to have different values. This guarantees at most one queen per column, which by the pigeonhole principle means there is exactly one queen per column.

# not diagonally adjacent
for i in range(size-1):
    q1, q2 = queens[i], queens[i+1]
    solver.add(Abs(q1 - q2) != 1)

One of the rules is that queens can't be adjacent. We already know that they can't be horizontally or vertically adjacent via other constraints, which leaves the diagonals. We only need to add constraints that, for each queen, there is no queen in the lower-left or lower-right corner, aka q_3 != q_2 ± 1. We don't need to check the top corners because if q_1 is in the upper-left corner of q_2, then q_2 is in the lower-right corner of q_1!

That covers everything except the "one queen per region" constraint. But the regions are the tricky part, which we should expect because we vary the difficulty of queens games by varying the regions.

regions = {
        "purple": [(0, 0), (0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7), (0, 8),
                   (1, 0), (2, 0), (3, 0), (4, 0), (5, 0), (6, 0), (7, 0), (8, 0),
                   (1, 1), (8, 1)],
        "red": [(1, 2), (2, 2), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (6, 2), (7, 1), (7, 2), (8, 2), (8, 3),],
        # you get the picture
        }

# Some checking code left out, see below

The region has to be manually coded in, which is a huge pain.

(In the link, some validation code follows. Since it breaks up explaining the model I put it in the next section.)

for r in regions.values():
    solver.add(Or(
        *[queens[row] == col for (row, col) in r]
        ))

Finally we have the region constraint. The easiest way I found to say "there is exactly one queen in each region" is to say "there is a queen in region 1 and a queen in region 2 and a queen in region 3" etc." Then to say "there is a queen in region purple" I wrote "q_0 = 0 OR q_0 = 1 OR … OR q_1 = 0 etc."

Why iterate over every position in the region instead of doing something like (0, q[0]) in r? I tried that but it's not an expression that Z3 supports.

if solver.check() == sat:
    m = solver.model()
    print([(l, m[l]) for l in queens])

Finally, we solve and print the positions. Running this gives me:

[(q__0, 0), (q__1, 5), (q__2, 8), 
 (q__3, 2), (q__4, 7), (q__5, 4), 
 (q__6, 1), (q__7, 3), (q__8, 6)]

Which is the correct solution to the queens puzzle. I didn't benchmark the solution times, but I imagine it's considerably slower than a raw SAT solver. Glucose is really, really fast.

But even so, solving the problem with SMT was a lot easier than solving it with SAT. That satisfies me as an explanation for why people prefer it to SAT.

Sanity checks

One bit I glossed over earlier was the sanity checking code. I knew for sure that I was going to make a mistake encoding the region, and the solver wasn't going to provide useful information abut what I did wrong. In cases like these, I like adding small tests and checks to catch mistakes early, because the solver certainly isn't going to catch them!

all_squares = set(product(range(size), repeat=2))
def test_i_set_up_problem_right():
    assert all_squares == set(chain.from_iterable(regions.values()))

    for r1, r2 in combinations(regions.values(), 2):
        assert not set(r1) & set(r2), set(r1) & set(r2)

The first check was a quick test that I didn't leave any squares out, or accidentally put the same square in both regions. Converting the values into sets makes both checks a lot easier. Honestly I don't know why I didn't just use sets from the start, sets are great.

def render_regions():
    colormap = ["purple",  "red", "brown", "white", "green", "yellow", "orange", "blue", "pink"]
    board = [[0 for _ in range(size)] for _ in range(size)] 
    for (row, col) in all_squares:
        for color, region in regions.items():
            if (row, col) in region:
                board[row][col] = colormap.index(color)+1

    for row in board:
        print("".join(map(str, row)))

render_regions()

The second check is something that prints out the regions. It produces something like this:

111111111
112333999
122439999
124437799
124666779
124467799
122467899
122555889
112258899

I can compare this to the picture of the board to make sure I got it right. I guess a more advanced solution would be to print emoji squares like 🟥 instead.

Neither check is quality code but it's throwaway and it gets the job done so eh.

Update for the Internet

This was sent as a weekly newsletter, which is usually on topics like software history, formal methods, unusual technologies, and the theory of software engineering. You can subscribe here.


  1. "Boolean SATisfiability Solver", aka a solver that can find assignments that make complex boolean expressions true. I write a bit more about them here

  2. "Satisfiability Modulo Theories" 

AI is a gamechanger for TLA+ users

5 June 2025 at 14:59

New Logic for Programmers Release

v0.10 is now available! This is a minor release, mostly focused on logic-based refactoring, with new material on set types and testing refactors are correct. See the full release notes at the changelog page. Due to conference pressure v0.11 will also likely be a minor release.

The book cover

AI is a gamechanger for TLA+ users

TLA+ is a specification language to model and debug distributed systems. While very powerful, it's also hard for programmers to learn, and there's always questions of connecting specifications with actual code.

That's why The Coming AI Revolution in Distributed Systems caught my interest. In the post, Cheng Huang claims that Azure successfully used LLMs to examine an existing codebase, derive a TLA+ spec, and find a production bug in that spec. "After a decade of manually crafting TLA+ specifications", he wrote, "I must acknowledge that this AI-generated specification rivals human work".

This inspired me to experiment with LLMs in TLA+ myself. My goals are a little less ambitious than Cheng's: I wanted to see how LLMs could help junior specifiers write TLA+, rather than handling the entire spec automatically. Details on what did and didn't work below, but my takeaway is that LLMs are an immense specification force multiplier.

All tests were done with a standard VSCode Copilot subscription, writing Claude 3.7 in Agent mode. Other LLMs or IDEs may be more or less effective, etc.

Things Claude was good at

Fixing syntax errors

TLA+ uses a very different syntax than mainstream programming languages, meaning beginners make a lot of mistakes where they do a "programming syntax" instead of TLA+ syntax:

NotThree(x) = \* should be ==, not =
    x != 3 \* should be #, not !=

The problem is that the TLA+ syntax checker, SANY, is 30 years old and doesn't provide good information. Here's what it says for that snippet:

Was expecting "==== or more Module body"
Encountered "NotThree" at line 6, column 1

That only isolates one error and doesn't tell us what the problem is, only where it is. Experienced TLA+ users get "error eyes" and can quickly see what the problem is, but beginners really struggle with this.

The TLA+ foundation has made LLM integration a priority, so the VSCode extension naturally supports several agents actions. One of these is running SANY, meaning an agent can get an error, fix it, get another error, fix it, etc. Provided the above sample and asked to make it work, Claude successfully fixed both errors. It also fixed many errors in a larger spec, as well as figure out why PlusCal specs weren't compiling to TLA+.

This by itself is already enough to make LLMs a worthwhile tool, as it fixes one of the biggest barriers to entry.

Understanding error traces

When TLA+ finds a violated property, it outputs the sequence of steps that leads to the error. This starts in plaintext, and VSCode parses it into an interactive table:

An example error trace

Learning to read these error traces is a skill in itself. You have to understand what's happening in each step and how it relates back to the actually broken property. It takes a long time for people to learn how to do this well.

Claude was successful here, too, accurately reading 20+ step error traces and giving a high-level explanation of what went wrong. It also could condense error traces: if ten steps of the error trace could be condensed into a one-sentence summary (which can happen if you're modeling a lot of process internals) Claude would do it.

I did have issues here with doing this in agent mode: while the extension does provide a "run model checker" command, the agent would regularly ignore this and prefer to run a terminal command instead. This would be fine except that the LLM consistently hallucinated invalid commands. I had to amend every prompt with "run the model checker via vscode, do not use a terminal command". You can skip this if you're willing to copy and paste the error trace into the prompt.

As with syntax checking, if this was the only thing LLMs could effectively do, that would already be enough1 to earn a strong recommend. Even as a TLA+ expert I expect I'll be using this trick regularly.

Boilerplate tasks

TLA+ has a lot of boilerplate. One of the most notorious examples is UNCHANGED rules. Specifications are extremely precise — so precise that you have to specify what variables don't change in every step. This takes the form of an UNCHANGED clause at the end of relevant actions:

RemoveObjectFromStore(srv, o, s) ==
  /\ o \in stored[s]
  /\ stored' = [stored EXCEPT ![s] = @ \ {o}]
  /\ UNCHANGED <<capacity, log, objectsize, pc>>

Writing this is really annoying. Updating these whenever you change an action, or add a new variable to the spec, is doubly so. Syntax checking and error analysis are important for beginners, but this is what I wanted for myself. I took a spec and prompted Claude

Add UNCHANGED <> for each variable not changed in an action.

And it worked! It successfully updated the UNCHANGED in every action.

(Note, though, that it was a "well-behaved" spec in this regard: only one "action" happened at a time. In TLA+ you can have two actions happen simultaneously, that each update half of the variables, meaning neither of them should have an UNCHANGED clause. I haven't tested how Claude handles that!)

That's the most obvious win, but Claude was good at handling other tedious work, too. Some examples include updating vars (the conventional collection of all state variables), lifting a hard-coded value into a model parameter, and changing data formats. Most impressive to me, though, was rewriting a spec designed for one process to instead handle multiple processes. This means taking all of the process variables, which originally have types like Int, converting them to types like [Process -> Int], and then updating the uses of all of those variables in the spec. It didn't account for race conditions in the new concurrent behavior, but it was an excellent scaffold to do more work.

Writing properties from an informal description

You have to be pretty precise with your intended property description but it handles converting that precise description into TLA+'s formalized syntax, which is something beginners often struggle with.

Things it is less good at

Generating model config files

To model check TLA+, you need both a specification (.tla) and a model config file (.cfg), which have separate syntaxes. Asking the agent to generate the second often lead to it using TLA+ syntax. It automatically fixed this after getting parsing errors, though.

Fixing specs

Whenever the ran model checking and discovered a bug, it would naturally propose a change to either the invalid property or the spec. Sometimes the changes were good, other times the changes were not physically realizable. For example, if it found that a bug was due to a race condition between processes, it would often suggest fixing it by saying race conditions were okay. I mean yes, if you say bugs are okay, then the spec finds that bugs are okay! Or it would alternatively suggest adding a constraint to the spec saying that race conditions don't happen. But that's a huge mistake in specification, because race conditions happen if we don't have coordination. We need to specify the mechanism that is supposed to prevent them.

Finding properties of the spec

After seeing how capable it was at translating my properties to TLA+, I started prompting Claude to come up with properties on its own. Unfortunately, almost everything I got back was either trivial, uninteresting, or too coupled to implementation details. I haven't tested if it would work better to ask it for "properties that may be violated".

Generating code from specs

I have to be specific here: Claude could sometimes convert Python into a passable spec, an vice versa. It wasn't good at recognizing abstraction. For example, TLA+ specifications often represent sequential operations with a state variable, commonly called pc. If modeling code that nonatomically retrieves a counter value and increments it, we'd have one action that requires pc = "Get" and sets the new value to "Inc", then another that requires it be "Inc" and sets it to "Done".

I found that Claude would try to somehow convert pc into part of the Python program's state, rather than recognize it as a TLA+ abstraction. On the other side, when converting python code to TLA+ it would often try to translate things like sleep into some part of the spec, not recognizing that it is abstractable into a distinct action. I didn't test other possible misconceptions, like converting randomness to nondeterminism.

For the record, when converting TLA+ to Python Claude tended to make simulators of the spec, rather than possible production code implementing the spec. I really wasn't expecting otherwise though.

Unexplored Applications

Things I haven't explored thoroughly but could possibly be effective, based on what I know about TLA+ and AI:

Writing Java Overrides

Most TLA+ operators are resolved via TLA+ interpreters, but you can also implement them in "native" Java. This lets you escape the standard language semantics and add capabilities like executing programs during model-checking or dynamically constrain the depth of the searched state space. There's a lot of cool things I think would be possible with overrides. The problem is there's only a handful of people in the world who know how to write them. But that handful have written quite a few overrides and I think there's enough there for Claude to work with.

Writing specs, given a reference mechanism

In all my experiments, the LLM only had my prompts and the occasional Python script as information. That makes me suspect that some of its problems with writing and fixing specs come down to not having a system model. Maybe it wouldn't suggest fixes like "these processes never race" if it had a design doc saying that the processes can't coordinate.

(Could a Sufficiently Powerful LLM derive some TLA+ specification from a design document?)

Connecting specs and code

This is the holy grail of TLA+: taking a codebase and showing it correctly implements a spec. Currently the best ways to do this are by either using TLA+ to generate a test suite, or by taking logged production traces and matching them to TLA+ behaviors. This blog post discusses both. While I've seen a lot of academic research into these approaches there are no industry-ready tools. So if you want trace validation you have to do a lot of manual labour tailored to your specific product.

If LLMs could do some of this work for us then that'd really amplify the usefulness of TLA+ to many companies.

Thoughts

Right now, agents seem good at the tedious and routine parts of TLA+ and worse at the strategic and abstraction parts. But, since the routine parts are often a huge barrier to beginners, this means that LLMs have the potential to make TLA+ far, far more accessible than it previously was.

I have mixed thoughts on this. As an advocate, this is incredible. I want more people using formal specifications because I believe it leads to cheaper, safer, more reliable software. Anything that gets people comfortable with specs is great for our industry. As a professional TLA+ consultant, I'm worried that this obsoletes me. Most of my income comes from training and coaching, which companies will have far less demand of now. Then again, maybe this an opportunity to pitch "agentic TLA+ training" to companies!

Anyway, if you're interested in TLA+, there has never been a better time to try it. I mean it, these tools handle so much of the hard part now. I've got a free book available online, as does the inventor of TLA+. I like this guide too. Happy modeling!


  1. Dayenu. 

What does "Undecidable" mean, anyway

28 May 2025 at 19:34

Systems Distributed

I'll be speaking at Systems Distributed next month! The talk is brand new and will aim to showcase some of the formal methods mental models that would be useful in mainstream software development. It has added some extra stress on my schedule, though, so expect the next two monthly releases of Logic for Programmers to be mostly minor changes.

What does "Undecidable" mean, anyway

Last week I read Against Curry-Howard Mysticism, which is a solid article I recommend reading. But this newsletter is actually about one comment:

I like to see posts like this because I often feel like I can’t tell the difference between BS and a point I’m missing. Can we get one for questions like “Isn’t XYZ (Undecidable|NP-Complete|PSPACE-Complete)?”

I've already written one of these for NP-complete, so let's do one for "undecidable". Step one is to pull a technical definition from the book Automata and Computability:

A property P of strings is said to be decidable if ... there is a total Turing machine that accepts input strings that have property P and rejects those that do not. (pg 220)

Step two is to translate the technical computer science definition into more conventional programmer terms. Warning, because this is a newsletter and not a blog post, I might be a little sloppy with terms.

Machines and Decision Problems

In automata theory, all inputs to a "program" are strings of characters, and all outputs are "true" or "false". A program "accepts" a string if it outputs "true", and "rejects" if it outputs "false". You can think of this as automata studying all pure functions of type f :: string -> boolean. Problems solvable by finding such an f are called "decision problems".

This covers more than you'd think, because we can bootstrap more powerful functions from these. First, as anyone who's programmed in bash knows, strings can represent any other data. Second, we can fake non-boolean outputs by instead checking if a certain computation gives a certain result. For example, I can reframe the function add(x, y) = x + y as a decision problem like this:

IS_SUM(str) {
    x, y, z = split(str, "#")
    return x + y == z
}

Then because IS_SUM("2#3#5") returns true, we know 2 + 3 == 5, while IS_SUM("2#3#6") is false. Since we can bootstrap parameters out of strings, I'll just say it's IS_SUM(x, y, z) going forward.

A big part of automata theory is studying different models of computation with different strengths. One of the weakest is called "DFA". I won't go into any details about what DFA actually can do, but the important thing is that it can't solve IS_SUM. That is, if you give me a DFA that takes inputs of form x#y#z, I can always find an input where the DFA returns true when x + y != z, or an input which returns false when x + y == z.

It's really important to keep this model of "solve" in mind: a program solves a problem if it correctly returns true on all true inputs and correctly returns false on all false inputs.

(total) Turing Machines

A Turing Machine (TM) is a particular type of computation model. It's important for two reasons:

  1. By the Church-Turing thesis, a Turing Machine is the "upper bound" of how powerful (physically realizable) computational models can get. This means that if an actual real-world programming language can solve a particular decision problem, so can a TM. Conversely, if the TM can't solve it, neither can the programming language.1

  2. It's possible to write a Turing machine that takes a textual representation of another Turing machine as input, and then simulates that Turing machine as part of its computations.

Property (1) means that we can move between different computational models of equal strength, proving things about one to learn things about another. That's why I'm able to write IS_SUM in a pseudocode instead of writing it in terms of the TM computational model (and why I was able to use split for convenience).

Property (2) does several interesting things. First of all, it makes it possible to compose Turing machines. Here's how I can roughly ask if a given number is the sum of two primes, with "just" addition and boolean functions:

IS_SUM_TWO_PRIMES(z):
    x := 1
    y := 1
    loop {
        if x > z {return false}
        if IS_PRIME(x) {
            if IS_PRIME(y) {
                if IS_SUM(x, y, z) {
                    return true;
                }
            }
        }
        y := y + 1
        if y > x {
            x := x + 1
            y := 0
        }
    }

Notice that without the if x > z {return false}, the program would loop forever on z=2. A TM that always halts for all inputs is called total.

Property (2) also makes "Turing machines" a possible input to functions, meaning that we can now make decision problems about the behavior of Turing machines. For example, "does the TM M either accept or reject x within ten steps?"2

IS_DONE_IN_TEN_STEPS(M, x) {
    for (i = 0; i < 10; i++) {
        `simulate M(x) for one step`
        if(`M accepted or rejected`) {
            return true
        }
    }
    return false
}

Decidability and Undecidability

Now we have all of the pieces to understand our original definition:

A property P of strings is said to be decidable if ... there is a total Turing machine that accepts input strings that have property P and rejects those that do not. (220)

Let IS_P be the decision problem "Does the input satisfy P"? Then IS_P is decidable if it can be solved by a Turing machine, ie, I can provide some IS_P(x) machine that always accepts if x has property P, and always rejects if x doesn't have property P. If I can't do that, then IS_P is undecidable.

IS_SUM(x, y, z) and IS_DONE_IN_TEN_STEPS(M, x) are decidable properties. Is IS_SUM_TWO_PRIMES(z) decidable? Some analysis shows that our corresponding program will either find a solution, or have x>z and return false. So yes, it is decidable.

Notice there's an asymmetry here. To prove some property is decidable, I need just to need to find one program that correctly solves it. To prove some property is undecidable, I need to show that any possible program, no matter what it is, doesn't solve it.

So with that asymmetry in mind, do are there any undecidable problems? Yes, quite a lot. Recall that Turing machines can accept encodings of other TMs as input, meaning we can write a TM that checks properties of Turing machines. And, by Rice's Theorem, almost every nontrivial semantic3 property of Turing machines is undecidable. The conventional way to prove this is to first find a single undecidable property H, and then use that to bootstrap undecidability of other properties.

The canonical and most famous example of an undecidable problem is the Halting problem: "does machine M halt on input i?" It's pretty easy to prove undecidable, and easy to use it to bootstrap other undecidability properties. But again, any nontrivial property is undecidable. Checking a TM is total is undecidable. Checking a TM accepts any inputs is undecidable. Checking a TM solves IS_SUM is undecidable. Etc etc etc.

What this doesn't mean in practice

I often see the halting problem misconstrued as "it's impossible to tell if a program will halt before running it." This is wrong. The halting problem says that we cannot create an algorithm that, when applied to an arbitrary program, tells us whether the program will halt or not. It is absolutely possible to tell if many programs will halt or not. It's possible to find entire subcategories of programs that are guaranteed to halt. It's possible to say "a program constructed following constraints XYZ is guaranteed to halt."

The actual consequence of undecidability is more subtle. If we want to know if a program has property P, undecidability tells us

  1. We will have to spend time and mental effort to determine if it has P
  2. We may not be successful.

This is subtle because we're so used to living in a world where everything's undecidable that we don't really consider what the counterfactual would be like. In such a world there might be no need for Rust, because "does this C program guarantee memory-safety" is a decidable property. The entire field of formal verification could be unnecessary, as we could just check properties of arbitrary programs directly. We could automatically check if a change in a program preserves all existing behavior. Lots of famous math problems could be solved overnight.

(This to me is a strong "intuitive" argument for why the halting problem is undecidable: a halt detector can be trivially repurposed as a program optimizer / theorem-prover / bcrypt cracker / chess engine. It's too powerful, so we should expect it to be impossible.)

But because we don't live in that world, all of those things are hard problems that take effort and ingenuity to solve, and even then we often fail.

Update for the Internet

This was sent as a weekly newsletter, which is usually on topics like software history, formal methods, unusual technologies, and the theory of software engineering. You can subscribe here.


  1. To be pendantic, a TM can't do things like "scrape a webpage" or "render a bitmap", but we're only talking about computational decision problems here. 

  2. One notation I've adopted in Logic for Programmers is marking abstract sections of pseudocode with backticks. It's really handy! 

  3. Nontrivial meaning "at least one TM has this property and at least one TM doesn't have this property". Semantic meaning "related to whether the TM accepts, rejects, or runs forever on a class of inputs". IS_DONE_IN_TEN_STEPS is not a semantic property, as it doesn't tell us anything about inputs that take longer than ten steps. 

Finding hard 24 puzzles with planner programming

20 May 2025 at 18:21

Planner programming is a programming technique where you solve problems by providing a goal and actions, and letting the planner find actions that reach the goal. In a previous edition of Logic for Programmers, I demonstrated how this worked by solving the 24 puzzle with planning. For reasons discussed here I replaced that example with something more practical (orchestrating deployments), but left the code online for posterity.

Recently I saw a family member try and fail to vibe code a tool that would find all valid 24 puzzles, and realized I could adapt the puzzle solver to also be a puzzle generator. First I'll explain the puzzle rules, then the original solver, then the generator.1 For a much longer intro to planning, see here.

The rules of 24

You're given four numbers and have to find some elementary equation (+-*/+groupings) that uses all four numbers and results in 24. Each number must be used exactly once, but do not need to be used in the starting puzzle order. Some examples:

  • [6, 6, 6, 6] -> 6+6+6+6=24
  • [1, 1, 6, 6] -> (6+6)*(1+1)=24
  • [4, 4, 4, 5] -> 4*(5+4/4)=24

Some setups are impossible, like [1, 1, 1, 1]. Others are possible only with non-elementary operations, like [1, 5, 5, 324] (which requires exponentiation).

The solver

We will use the Picat, the only language that I know has a built-in planner module. The current state of our plan with be represented by a single list with all of the numbers.

import planner, math.
import cp.

action(S0, S1, Action, Cost) ?=>
  member(X, S0)
  , S0 := delete(S0, X) % , is `and`
  , member(Y, S0)
  , S0 := delete(S0, Y)
  , (
      A = $(X + Y) 
    ; A = $(X - Y)
    ; A = $(X * Y)
    ; A = $(X / Y), Y > 0
    )
    , S1 = S0 ++ [apply(A)]
  , Action = A
  , Cost = 1
  .

This is our "action", and it works in three steps:

  1. Nondeterministically pull two different values out of the input, deleting them
  2. Nondeterministically pick one of the basic operations
  3. The new state is the remaining elements, appended with that operation applied to our two picks.

Let's walk through this with [1, 6, 1, 7]. There are four choices for X and three four Y. If the planner chooses X=6 and Y=7, A = $(6 + 7). This is an uncomputed term in the same way lisps might use quotation. We can resolve the computation with apply, as in the line S1 = S0 ++ [apply(A)].

final([N]) =>
  N =:= 24. % handle floating point

Our final goal is just a list where the only element is 24. This has to be a little floating point-sensitive to handle floating point divison, done by =:=.

main =>
  Start = [1, 5, 5, 6]
  , best_plan(Start, 4, Plan)
  , printf("%w %w%n", Start, Plan)
  .

For main, we just find the best plan with the maximum cost of 4 and print it. When run from the command line, picat automatically executes whatever is in main.

$ picat 24.pi
[1,5,5,6] [1 + 5,5 * 6,30 - 6]

I don't want to spoil any more 24 puzzles, so let's stop showing the plan:

main =>
- , printf("%w %w%n", Start, Plan)
+ , printf("%w%n", Start)

Generating puzzles

Picat provides a find_all(X, p(X)) function, which ruturns all X for which p(X) is true. In theory, we could write find_all(S, best_plan(S, 4, _). In practice, there are an infinite number of valid puzzles, so we need to bound S somewhat. We also don't want to find any redundant puzzles, such as [6, 6, 6, 4] and [4, 6, 6, 6].

We can solve both issues by writing a helper valid24(S), which will check that S a sorted list of integers within some bounds, like 1..8, and also has a valid solution.

valid24(Start) =>
  Start = new_list(4)
  , Start :: 1..8 % every value in 1..8
  , increasing(Start) % sorted ascending
  , solve(Start) % turn into values
  , best_plan(Start, 4, Plan)
  .

This leans on Picat's constraint solving features to automatically find bounded sorted lists, which is why we need the solve step.2 Now we can just loop through all of the values in find_all to get all solutions:

main =>
  foreach([S] in find_all(
    [Start],
    valid24(Start)))
    printf("%w%n", S)
  end.
$ picat 24.pi

[1,1,1,8]
[1,1,2,6]
[1,1,2,7]
[1,1,2,8]
# etc

Finding hard puzzles

Last Friday I realized I could do something more interesting with this. Once I have found a plan, I can apply further constraints to the plan, for example to find problems that can be solved with division:

valid24(Start, Plan) =>
  Start = new_list(4)
  , Start :: 1..8
  , increasing(Start)
  , solve(Start)
  , best_plan(Start, 4, Plan)
+ , member($(_ / _), Plan)
  .

In playing with this, though, I noticed something weird: there are some solutions that appear if I sort up but not down. For example, [3,3,4,5] appears in the solution set, but [5, 4, 3, 3] doesn't appear if I replace increasing with decreasing.

As far as I can tell, this is because Picat only finds one best plan, and [5, 4, 3, 3] has two solutions: 4*(5-3/3) and 3*(5+4)-3. best_plan is a deterministic operator, so Picat commits to the first best plan it finds. So if it finds 3*(5+4)-3 first, it sees that the solution doesn't contain a division, throws [5, 4, 3, 3] away as a candidate, and moves on to the next puzzle.

There's a couple ways we can fix this. We could replace best_plan with best_plan_nondet, which can backtrack to find new plans (at the cost of an enormous number of duplicates). Or we could modify our final to only accept plans with a division:

% Hypothetical change
final([N]) =>
+ member($(_ / _), current_plan()),
  N =:= 24.

My favorite "fix" is to ask another question entirely. While I was looking for puzzles that can be solved with division, what I actually want is puzzles that must be solved with division. What if I rejected any puzzle that has a solution without division?

+ plan_with_no_div(S, P) => best_plan_nondet(S, 4, P), not member($(_ / _), P).

valid24(Start, Plan) =>
  Start = new_list(4)
  , Start :: 1..8
  , increasing(Start)
  , solve(Start)
  , best_plan(Start, 4, Plan)
- , member($(_ / _), Plan)
+ , not plan_with_no_div(Start, _)
  .

The new line's a bit tricky. plan_with_div nondeterministically finds a plan, and then fails if the plan contains a division.3 Since I used best_plan_nondet, it can backtrack from there and find a new plan. This means plan_with_no_div only fails if not such plan exists. And in valid24, we only succeed if plan_with_no_div fails, guaranteeing that the only existing plans use division. Since this doesn't depend on the plan found via best_plan, it doesn't matter how the values in Start are arranged, this will not miss any valid puzzles.

Aside for my logic book readers

The new clause is equivalent to !(some p: Plan(p) && !(div in p)). Applying the simplifications we learned:

  1. !(some p: Plan(p) && !(div in p)) (init)
  2. all p: !(plan(p) && !(div in p)) (all/some duality)
  3. all p: !plan(p) || div in p) (De Morgan's law)
  4. all p: plan(p) => div in p (implication definition)

Which more obviously means "if P is a valid plan, then it contains a division".

Back to finding hard puzzles

Anyway, with not plan_with_no_div, we are filtering puzzles on the set of possible solutions, not just specific solutions. And this gives me an idea: what if we find puzzles that have only one solution?

different_plan(S, P) => best_plan_nondet(S, 4, P2), P2 != P.

valid24(Start, Plan) =>
+ , not different_plan(Start, Plan)

I tried this from 1..8 and got:

[1,2,7,7]
[1,3,4,6]
[1,6,6,8]
[3,3,8,8]

These happen to be some of the hardest 24 puzzles known, though not all of them. Note this is assuming that (X + Y) and (Y + X) are different solutions. If we say they're the same (by appending writing A = $(X + Y), X <= Y in our action) then we got a lot more puzzles, many of which are considered "easy". Other "hard" things we can look for include plans that require fractions:

plan_with_no_fractions(S, P) => 
  best_plan_nondet(S, 4, P)
  , not(
    member(X, P),
    round(apply(X)) =\= X
  ).

% insert `not plan...` in valid24 as usual

Finally, we could try seeing if a negative number is required:

plan_with_no_negatives(S, P) => 
  best_plan_nondet(S, 4, P)
  , not(
    member(X, P),
    apply(X) < 0
  ).

Interestingly this one returns no solutions, so you are never required to construct a negative number as part of a standard 24 puzzle.


  1. The code below is different than old book version, as it uses more fancy logic programming features that aren't good in learning material. 

  2. increasing is a constraint predicate. We could alternatively write sorted, which is a Picat logical predicate and must be placed after solve. There doesn't seem to be any efficiency gains either way. 

  3. I don't know what the standard is in Picat, but in Prolog, the convention is to use \+ instead of not. They mean the same thing, so I'm using not because it's clearer to non-LPers. 

Modeling Awkward Social Situations with TLA+

14 May 2025 at 16:02

You're walking down the street and need to pass someone going the opposite way. You take a step left, but they're thinking the same thing and take a step to their right, aka your left. You're still blocking each other. Then you take a step to the right, and they take a step to their left, and you're back to where you started. I've heard this called "walkwarding"

Let's model this in TLA+. TLA+ is a formal methods tool for finding bugs in complex software designs, most often involving concurrency. Two people trying to get past each other just also happens to be a concurrent system. A gentler introduction to TLA+'s capabilities is here, an in-depth guide teaching the language is here.

The spec

---- MODULE walkward ----
EXTENDS Integers

VARIABLES pos
vars == <<pos>>

Double equals defines a new operator, single equals is an equality check. <<pos>> is a sequence, aka array.

you == "you"
me == "me"
People == {you, me}

MaxPlace == 4

left == 0
right == 1

I've gotten into the habit of assigning string "symbols" to operators so that the compiler complains if I misspelled something. left and right are numbers so we can shift position with right - pos.

direction == [you |-> 1, me |-> -1]
goal == [you |-> MaxPlace, me |-> 1]

Init ==
  \* left-right, forward-backward
  pos = [you |-> [lr |-> left, fb |-> 1], me |-> [lr |-> left, fb |-> MaxPlace]]

direction, goal, and pos are "records", or hash tables with string keys. I can get my left-right position with pos.me.lr or pos["me"]["lr"] (or pos[me].lr, as me == "me").

Juke(person) ==
  pos' = [pos EXCEPT ![person].lr = right - @]

TLA+ breaks the world into a sequence of steps. In each step, pos is the value of pos in the current step and pos' is the value in the next step. The main outcome of this semantics is that we "assign" a new value to pos by declaring pos' equal to something. But the semantics also open up lots of cool tricks, like swapping two values with x' = y /\ y' = x.

TLA+ is a little weird about updating functions. To set f[x] = 3, you gotta write f' = [f EXCEPT ![x] = 3]. To make things a little easier, the rhs of a function update can contain @ for the old value. ![me].lr = right - @ is the same as right - pos[me].lr, so it swaps left and right.

("Juke" comes from here)

Move(person) ==
  LET new_pos == [pos[person] EXCEPT !.fb = @ + direction[person]]
  IN
    /\ pos[person].fb # goal[person]
    /\ \A p \in People: pos[p] # new_pos
    /\ pos' = [pos EXCEPT ![person] = new_pos]

The EXCEPT syntax can be used in regular definitions, too. This lets someone move one step in their goal direction unless they are at the goal or someone is already in that space. /\ means "and".

Next ==
  \E p \in People:
    \/ Move(p)
    \/ Juke(p)

I really like how TLA+ represents concurrency: "In each step, there is a person who either moves or jukes." It can take a few uses to really wrap your head around but it can express extraordinarily complicated distributed systems.

Spec == Init /\ [][Next]_vars

Liveness == <>(pos[me].fb = goal[me])
====

Spec is our specification: we start at Init and take a Next step every step.

Liveness is the generic term for "something good is guaranteed to happen", see here for more. <> means "eventually", so Liveness means "eventually my forward-backward position will be my goal". I could extend it to "both of us eventually reach out goal" but I think this is good enough for a demo.

Checking the spec

Four years ago, everybody in TLA+ used the toolbox. Now the community has collectively shifted over to using the VSCode extension.1 VSCode requires we write a configuration file, which I will call walkward.cfg.

SPECIFICATION Spec
PROPERTY Liveness

I then check the model with the VSCode command TLA+: Check model with TLC. Unsurprisingly, it finds an error:

Screenshot 2025-05-12 153537.png

The reason it fails is "stuttering": I can get one step away from my goal and then just stop moving forever. We say the spec is unfair: it does not guarantee that if progress is always possible, progress will be made. If I want the spec to always make progress, I have to make some of the steps weakly fair.

+ Fairness == WF_vars(Next)

- Spec == Init /\ [][Next]_vars
+ Spec == Init /\ [][Next]_vars /\ Fairness

Now the spec is weakly fair, so someone will always do something. New error:

\* First six steps cut
7: <Move("me")>
pos = [you |-> [lr |-> 0, fb |-> 4], me |-> [lr |-> 1, fb |-> 2]]
8: <Juke("me")>
pos = [you |-> [lr |-> 0, fb |-> 4], me |-> [lr |-> 0, fb |-> 2]]
9: <Juke("me")> (back to state 7)

In this failure, I've successfully gotten past you, and then spend the rest of my life endlessly juking back and forth. The Next step keeps happening, so weak fairness is satisfied. What I actually want is for both my Move and my Juke to both be weakly fair independently of each other.

- Fairness == WF_vars(Next)
+ Fairness == WF_vars(Move(me)) /\ WF_vars(Juke(me))

If my liveness property also specified that you reached your goal, I could instead write \A p \in People: WF_vars(Move(p)) etc. I could also swap the \A with a \E to mean at least one of us is guaranteed to have fair actions, but not necessarily both of us.

New error:

3: <Move("me")>
pos = [you |-> [lr |-> 0, fb |-> 2], me |-> [lr |-> 0, fb |-> 3]]
4: <Juke("you")>
pos = [you |-> [lr |-> 1, fb |-> 2], me |-> [lr |-> 0, fb |-> 3]]
5: <Juke("me")>
pos = [you |-> [lr |-> 1, fb |-> 2], me |-> [lr |-> 1, fb |-> 3]]
6: <Juke("me")>
pos = [you |-> [lr |-> 1, fb |-> 2], me |-> [lr |-> 0, fb |-> 3]]
7: <Juke("you")> (back to state 3)

Now we're getting somewhere! This is the original walkwarding situation we wanted to capture. We're in each others way, then you juke, but before either of us can move you juke, then we both juke back. We can repeat this forever, trapped in a social hell.

Wait, but doesn't WF(Move(me)) guarantee I will eventually move? Yes, but only if a move is permanently available. In this case, it's not permanently available, because every couple of steps it's made temporarily unavailable.

How do I fix this? I can't add a rule saying that we only juke if we're blocked, because the whole point of walkwarding is that we're not coordinated. In the real world, walkwarding can go on for agonizing seconds. What I can do instead is say that Liveness holds as long as Move is strongly fair. Unlike weak fairness, strong fairness guarantees something happens if it keeps becoming possible, even with interruptions.

Liveness == 
+  SF_vars(Move(me)) => 
    <>(pos[me].fb = goal[me])

This makes the spec pass. Even if we weave back and forth for five minutes, as long as we eventually pass each other, I will reach my goal. Note we could also by making Move in Fairness strongly fair, which is preferable if we have a lot of different liveness properties to check.

A small exercise for the reader

There is a presumed invariant that is violated. Identify what it is, write it as a property in TLA+, and show the spec violates it. Then fix it.

Answer (in rot13): Gur vainevnag vf "ab gjb crbcyr ner va gur rknpg fnzr ybpngvba". Zbir thnenagrrf guvf ohg Whxr qbrf abg.

More TLA+ Exercises

I've started work on an exercises repo. There's only a handful of specific problems now but I'm planning on adding more over the summer.


  1. learntla is still on the toolbox, but I'm hoping to get it all moved over this summer. 

Write the most clever code you possibly can

8 May 2025 at 15:04

I started writing this early last week but Real Life Stuff happened and now you're getting the first-draft late this week. Warning, unedited thoughts ahead!

New Logic for Programmers release!

v0.9 is out! This is a big release, with a new cover design, several rewritten chapters, online code samples and much more. See the full release notes at the changelog page, and get the book here!

The new cover! It's a lot nicer

Write the cleverest code you possibly can

There are millions of articles online about how programmers should not write "clever" code, and instead write simple, maintainable code that everybody understands. Sometimes the example of "clever" code looks like this (src):

# Python

p=n=1
exec("p*=n*n;n+=1;"*~-int(input()))
print(p%n)

This is code-golfing, the sport of writing the most concise code possible. Obviously you shouldn't run this in production for the same reason you shouldn't eat dinner off a Rembrandt.

Other times the example looks like this:

def is_prime(x):
    if x == 1:
        return False
    return all([x%n != 0 for n in range(2, x)])

This is "clever" because it uses a single list comprehension, as opposed to a "simple" for loop. Yes, "list comprehensions are too clever" is something I've read in one of these articles.

I've also talked to people who think that datatypes besides lists and hashmaps are too clever to use, that most optimizations are too clever to bother with, and even that functions and classes are too clever and code should be a linear script.1. Clever code is anything using features or domain concepts we don't understand. Something that seems unbearably clever to me might be utterly mundane for you, and vice versa.

How do we make something utterly mundane? By using it and working at the boundaries of our skills. Almost everything I'm "good at" comes from banging my head against it more than is healthy. That suggests a really good reason to write clever code: it's an excellent form of purposeful practice. Writing clever code forces us to code outside of our comfort zone, developing our skills as software engineers.

Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you [will get excellent debugging practice at exactly the right level required to push your skills as a software engineer] — Brian Kernighan, probably

There are other benefits, too, but first let's kill the elephant in the room:2

Don't commit clever code

I am proposing writing clever code as a means of practice. Being at work is a job with coworkers who will not appreciate if your code is too clever. Similarly, don't use too many innovative technologies. Don't put anything in production you are uncomfortable with.

We can still responsibly write clever code at work, though:

  1. Solve a problem in both a simple and a clever way, and then only commit the simple way. This works well for small scale problems where trying the "clever way" only takes a few minutes.
  2. Write our personal tools cleverly. I'm a big believer of the idea that most programmers would benefit from writing more scripts and support code customized to their particular work environment. This is a great place to practice new techniques, languages, etc.
  3. If clever code is absolutely the best way to solve a problem, then commit it with extensive documentation explaining how it works and why it's preferable to simpler solutions. Bonus: this potentially helps the whole team upskill.

Writing clever code...

...teaches simple solutions

Usually, code that's called too clever composes several powerful features together — the "not a single list comprehension or function" people are the exception. Josh Comeau's "don't write clever code" article gives this example of "too clever":

const extractDataFromResponse = (response) => {
  const [Component, props] = response;

  const resultsEntries = Object.entries({ Component, props });
  const assignIfValueTruthy = (o, [k, v]) => (v
    ? { ...o, [k]: v }
    : o
  );

  return resultsEntries.reduce(assignIfValueTruthy, {});
}

What makes this "clever"? I count eight language features composed together: entries, argument unpacking, implicit objects, splats, ternaries, higher-order functions, and reductions. Would code that used only one or two of these features still be "clever"? I don't think so. These features exist for a reason, and oftentimes they make code simpler than not using them.

We can, of course, learn these features one at a time. Writing the clever version (but not committing it) gives us practice with all eight at once and also with how they compose together. That knowledge comes in handy when we want to apply a single one of the ideas.

I've recently had to do a bit of pandas for a project. Whenever I have to do a new analysis, I try to write it as a single chain of transformations, and then as a more balanced set of updates.

...helps us master concepts

Even if the composite parts of a "clever" solution aren't by themselves useful, it still makes us better at the overall language, and that's inherently valuable. A few years ago I wrote Crimes with Python's Pattern Matching. It involves writing horrible code like this:

from abc import ABC

class NotIterable(ABC):

    @classmethod
    def __subclasshook__(cls, C):
        return not hasattr(C, "__iter__")

def f(x):
    match x:
        case NotIterable():
            print(f"{x} is not iterable")
        case _:
            print(f"{x} is iterable")

if __name__ == "__main__":
    f(10)
    f("string")
    f([1, 2, 3])

This composes Python match statements, which are broadly useful, and abstract base classes, which are incredibly niche. But even if I never use ABCs in real production code, it helped me understand Python's match semantics and Method Resolution Order better.

...prepares us for necessity

Sometimes the clever way is the only way. Maybe we need something faster than the simplest solution. Maybe we are working with constrained tools or frameworks that demand cleverness. Peter Norvig argued that design patterns compensate for missing language features. I'd argue that cleverness is another means of compensating: if our tools don't have an easy way to do something, we need to find a clever way.

You see this a lot in formal methods like TLA+. Need to check a hyperproperty? Cast your state space to a directed graph. Need to compose ten specifications together? Combine refinements with state machines. Most difficult problems have a "clever" solution. The real problem is that clever solutions have a skill floor. If normal use of the tool is at difficult 3 out of 10, then basic clever solutions are at 5 out of 10, and it's hard to jump those two steps in the moment you need the cleverness.

But if you've practiced with writing overly clever code, you're used to working at a 7 out of 10 level in short bursts, and then you can "drop down" to 5/10. I don't know if that makes too much sense, but I see it happen a lot in practice.

...builds comradery

On a few occasions, after getting a pull request merged, I pulled the reviewer over and said "check out this horrible way of doing the same thing". I find that as long as people know they're not going to be subjected to a clever solution in production, they enjoy seeing it!

Next week's newsletter will probably also be late, after that we should be back to a regular schedule for the rest of the summer.


  1. Mostly grad students outside of CS who have to write scripts to do research. And in more than one data scientist. I think it's correlated with using Jupyter. 

  2. If I don't put this at the beginning, I'll get a bajillion responses like "your team will hate you" 

Before yesterdayComputer+Things

Requirements change until they don't

24 April 2025 at 11:00

Recently I got a question on formal methods1: how does it help to mathematically model systems when the system requirements are constantly changing? It doesn't make sense to spend a lot of time proving a design works, and then deliver the product and find out it's not at all what the client needs. As the saying goes, the hard part is "building the right thing", not "building the thing right".

One possible response: "why write tests"? You shouldn't write tests, especially lots of unit tests ahead of time, if you might just throw them all away when the requirements change.

This is a bad response because we all know the difference between writing tests and formal methods: testing is easy and FM is hard. Testing requires low cost for moderate correctness, FM requires high(ish) cost for high correctness. And when requirements are constantly changing, "high(ish) cost" isn't affordable and "high correctness" isn't worthwhile, because a kinda-okay solution that solves a customer's problem is infinitely better than a solid solution that doesn't.

But eventually you get something that solves the problem, and what then?

Most of us don't work for Google, we can't axe features and products on a whim. If the client is happy with your solution, you are expected to support it. It should work when your customers run into new edge cases, or migrate all their computers to the next OS version, or expand into a market with shoddy internet. It should work when 10x as many customers are using 10x as many features. It should work when you add new features that come into conflict.

And just as importantly, it should never stop solving their problem. Canonical example: your feature involves processing requested tasks synchronously. At scale, this doesn't work, so to improve latency you make it asynchronous. Now it's eventually consistent, but your customers were depending on it being always consistent. Now it no longer does what they need, and has stopped solving their problems.

Every successful requirement met spawns a new requirement: "keep this working". That requirement is permanent, or close enough to decide our long-term strategy. It takes active investment to keep a feature behaving the same as the world around it changes.

(Is this all a pretentious of way of saying "software maintenance is hard?" Maybe!)

Phase changes

In physics there's a concept of a phase transition. To raise the temperature of a gram of liquid water by 1° C, you have to add 4.184 joules of energy.2 This continues until you raise it to 100°C, then it stops. After you've added two thousand joules to that gram, it suddenly turns into steam. The energy of the system changes continuously but the form, or phase, changes discretely.

Phase_diagram_of_water_simplified.svg.png (from above link)

Software isn't physics but the idea works as a metaphor. A certain architecture handles a certain level of load, and past that you need a new architecture. Or a bunch of similar features are independently hardcoded until the system becomes too messy to understand, you remodel the internals into something unified and extendable. etc etc etc. It's doesn't have to be totally discrete phase transition, but there's definitely a "before" and "after" in the system form.

Phase changes tend to lead to more intricacy/complexity in the system, meaning it's likely that a phase change will introduce new bugs into existing behaviors. Take the synchronous vs asynchronous case. A very simple toy model of synchronous updates would be Set(key, val), which updates data[key] to val.3 A model of asynchronous updates would be AsyncSet(key, val, priority) adds a (key, val, priority, server_time()) tuple to a tasks set, and then another process asynchronously pulls a tuple (ordered by highest priority, then earliest time) and calls Set(key, val). Here are some properties the client may need preserved as a requirement:

  • If AsyncSet(key, val, _, _) is called, then eventually db[key] = val (possibly violated if higher-priority tasks keep coming in)
  • If someone calls AsyncSet(key1, val1, low) and then AsyncSet(key2, val2, low), they should see the first update and then the second (linearizability, possibly violated if the requests go to different servers with different clock times)
  • If someone calls AsyncSet(key, val, _) and immediately reads db[key] they should get val (obviously violated, though the client may accept a slightly weaker property)

If the new system doesn't satisfy an existing customer requirement, it's prudent to fix the bug before releasing the new system. The customer doesn't notice or care that your system underwent a phase change. They'll just see that one day your product solves their problems, and the next day it suddenly doesn't.

This is one of the most common applications of formal methods. Both of those systems, and every one of those properties, is formally specifiable in a specification language. We can then automatically check that the new system satisfies the existing properties, and from there do things like automatically generate test suites. This does take a lot of work, so if your requirements are constantly changing, FM may not be worth the investment. But eventually requirements stop changing, and then you're stuck with them forever. That's where models shine.


  1. As always, I'm using formal methods to mean the subdiscipline of formal specification of designs, leaving out the formal verification of code. Mostly because "formal specification" is really awkward to say. 

  2. Also called a "calorie". The US "dietary Calorie" is actually a kilocalorie. 

  3. This is all directly translatable to a TLA+ specification, I'm just describing it in English to avoid paying the syntax tax 

The Halting Problem is a terrible example of NP-Harder

16 April 2025 at 17:39

Short one this time because I have a lot going on this week.

In computation complexity, NP is the class of all decision problems (yes/no) where a potential proof (or "witness") for "yes" can be verified in polynomial time. For example, "does this set of numbers have a subset that sums to zero" is in NP. If the answer is "yes", you can prove it by presenting a set of numbers. We would then verify the witness by 1) checking that all the numbers are present in the set (~linear time) and 2) adding up all the numbers (also linear).

NP-complete is the class of "hardest possible" NP problems. Subset sum is NP-complete. NP-hard is the set all problems at least as hard as NP-complete. Notably, NP-hard is not a subset of NP, as it contains problems that are harder than NP-complete. A natural question to ask is "like what?" And the canonical example of "NP-harder" is the halting problem (HALT): does program P halt on input C? As the argument goes, it's undecidable, so obviously not in NP.

I think this is a bad example for two reasons:

  1. All NP requires is that witnesses for "yes" can be verified in polynomial time. It does not require anything for the "no" case! And even though HP is undecidable, there is a decidable way to verify a "yes": let the witness be "it halts in N steps", then run the program for that many steps and see if it halted by then. To prove HALT is not in NP, you have to show that this verification process grows faster than polynomially. It does (as busy beaver is uncomputable), but this all makes the example needlessly confusing.1

  2. "What's bigger than a dog? THE MOON"

Really (2) bothers me a lot more than (1) because it's just so inelegant. It suggests that NP-complete is the upper bound of "solvable" problems, and after that you're in full-on undecidability. I'd rather show intuitive problems that are harder than NP but not that much harder.

But in looking for a "slightly harder" problem, I ran into an, ah, problem. It seems like the next-hardest class would be EXPTIME, except we don't know for sure that NP != EXPTIME. We know for sure that NP != NEXPTIME, but NEXPTIME doesn't have any intuitive, easily explainable problems. Most "definitely harder than NP" problems require a nontrivial background in theoretical computer science or mathematics to understand.

There is one problem, though, that I find easily explainable. Place a token at the bottom left corner of a grid that extends infinitely up and right, call that point (0, 0). You're given list of valid displacement moves for the token, like (+1, +0), (-20, +13), (-5, -6), etc, and a target point like (700, 1). You may make any sequence of moves in any order, as long as no move ever puts the token off the grid. Does any sequence of moves bring you to the target?

This is PSPACE-complete, I think, which still isn't proven to be harder than NP-complete (though it's widely believed). But what if you increase the number of dimensions of the grid? Past a certain number of dimensions the problem jumps to being EXPSPACE-complete, and then TOWER-complete (grows tetrationally), and then it keeps going. Some point might recognize this as looking a lot like the Ackermann function, and in fact this problem is ACKERMANN-complete on the number of available dimensions.

A friend wrote a Quanta article about the whole mess, you should read it.

This problem is ludicrously bigger than NP ("Chicago" instead of "The Moon"), but at least it's clearly decidable, easily explainable, and definitely not in NP.


  1. It's less confusing if you're taught the alternate (and original!) definition of NP, "the class of problems solvable in polynomial time by a nondeterministic Turing machine". Then HALT can't be in NP because otherwise runtime would be bounded by an exponential function.

Solving a "Layton Puzzle" with Prolog

8 April 2025 at 18:34

I have a lot in the works for the this month's Logic for Programmers release. Among other things, I'm completely rewriting the chapter on Logic Programming Languages.

I originally showcased the paradigm with puzzle solvers, like eight queens or four-coloring. Lots of other demos do this too! It takes creativity and insight for humans to solve them, so a program doing it feels magical. But I'm trying to write a book about practical techniques and I want everything I talk about to be useful. So in v0.9 I'll be replacing these examples with a couple of new programs that might get people thinking that Prolog could help them in their day-to-day work.

On the other hand, for a newsletter, showcasing a puzzle solver is pretty cool. And recently I stumbled into this post by my friend Pablo Meier, where he solves a videogame puzzle with Prolog:1

See description below

Summary for the text-only readers: We have a test with 10 true/false questions (denoted a/b) and four student attempts. Given the scores of the first three students, we have to figure out the fourth student's score.

bbababbabb = 7
baaababaaa = 5
baaabbbaba = 3
bbaaabbaaa = ???

You can see Pablo's solution here, and try it in SWI-prolog here. Pretty cool! But after way too long studying Prolog just to write this dang book chapter, I wanted to see if I could do it more elegantly than him. Code and puzzle spoilers to follow.

(Normally here's where I'd link to a gentler introduction I wrote but I think this is my first time writing about Prolog online? Uh here's a Picat intro instead)

The Program

You can try this all online at SWISH or just jump to my final version here.

:- use_module(library(dif)).    % Sound inequality
:- use_module(library(clpfd)).  % Finite domain constraints

First some imports. dif lets us write dif(A, B), which is true if A and B are not equal. clpfd lets us write A #= B + 1 to say "A is 1 more than B".2

We'll say both the student submission and the key will be lists, where each value is a or b. In Prolog, lowercase identifiers are atoms (like symbols in other languages) and identifiers that start with a capital are variables. Prolog finds values for variables that match equations (unification). The pattern matching is real real good.

% ?- means query
?- L = [a,B,c], [Y|X] = [1,2|L], B + 1 #= 7.

B = 6,
L = [a, 6, c],
X = [2, a, 6, c],
Y = 1

Next, we define score/33 recursively.

% The student's test score
% score(student answers, answer key, score)
score([], [], 0).
score([A|As], [A|Ks], N) :-
   N #= M + 1, score(As, Ks, M).
score([A|As], [K|Ks], N) :- 
    dif(A, K), score(As, Ks, N).

First key is the student's answers, second is the answer key, third is the final score. The base case is the empty test, which has score 0. Otherwise, we take the head values of each list and compare them. If they're the same, we add one to the score, otherwise we keep the same score.

Notice we couldn't write if x then y else z, we instead used pattern matching to effectively express (x && y) || (!x && z). Prolog does have a conditional operator, but it prevents backtracking so what's the point???

A quick break about bidirectionality

One of the coolest things about Prolog: all purely logical predicates are bidirectional. We can use score to check if our expected score is correct:

?- score([a, b, b], [b, b, b], 2).
true

But we can also give it answers and a key and ask it for the score:

?- score([a, b, b], [b, b, b], X).
X = 2

Or we could give it a key and a score and ask "what test answers would have this score?"

?- score(X, [b, b, b], 2).
X = [b, b, _A],
dif(_A,b)
X = [b, _A, b],
dif(_A,b)
X = [_A, b, b],
dif(_A,b)

The different value is written _A because we never told Prolog that the array can only contain a and b. We'll fix this later.

Okay back to the program

Now that we have a way of computing scores, we want to find a possible answer key that matches all of our observations, ie gives everybody the correct scores.

key(Key) :-
    % Figure it out
    score([b, b, a, b, a, b, b, a, b, b], Key, 7),
    score([b, a, a, a, b, a, b, a, a, a], Key, 5),
    score([b, a, a, a, b, b, b, a, b, a], Key, 3).

So far we haven't explicitly said that the Key length matches the student answer lengths. This is implicitly verified by score (both lists need to be empty at the same time) but it's a good idea to explicitly add length(Key, 10) as a clause of key/1. We should also explicitly say that every element of Key is either a or b.4 Now we could write a second predicate saying Key had the right 'type':

keytype([]).
keytype([K|Ks]) :- member(K, [a, b]), keytype(Ks).

But "generating lists that match a constraint" is a thing that comes up often enough that we don't want to write a separate predicate for each constraint! So after some digging, I found a more elegant solution: maplist. Let L=[l1, l2]. Then maplist(p, L) is equivalent to the clause p(l1), p(l2). It also accepts partial predicates: maplist(p(x), L) is equivalent to p(x, l1), p(x, l2). So we could write5

contains(L, X) :- member(X, L).

key(Key) :-
    length(Key, 10),
    maplist(contains([a,b]), L),
    % the score stuff

Now, let's query for the Key:

?- key(Key)
Key = [a, b, a, b, a, a, b, a, a, b]
Key = [b, b, a, b, a, a, a, a, a, b]
Key = [b, b, a, b, a, a, b, b, a, b]
Key = [b, b, b, b, a, a, b, a, a, b]

So there are actually four different keys that all explain our data. Does this mean the puzzle is broken and has multiple different answers?

Nope

The puzzle wasn't to find out what the answer key was, the point was to find the fourth student's score. And if we query for it, we see all four solutions give him the same score:

?- key(Key), score([b, b, a, a, a, b, b, a, a, a], Key, X).
X = 6
X = 6
X = 6
X = 6

Huh! I really like it when puzzles look like they're broken, but every "alternate" solution still gives the same puzzle answer.

Total program length: 15 lines of code, compared to the original's 80 lines. Suck it, Pablo.

(Incidentally, you can get all of the answer at once by writing findall(X, (key(Key), score($answer-array, Key, X)), L).)

I still don't like puzzles for teaching

The actual examples I'm using in the book are "analyzing a version control commit graph" and "planning a sequence of infrastructure changes", which are somewhat more likely to occur at work than needing to solve a puzzle. You'll see them in the next release!


  1. I found it because he wrote Gamer Games for Lite Gamers as a response to my Gamer Games for Non-Gamers

  2. These are better versions of the core Prolog expressions \+ (A = B) and A is B + 1, because they can defer unification

  3. Prolog-descendants have a convention of writing the arity of the function after its name, so score/3 means "score has three parameters". I think they do this because you can overload predicates with multiple different arities. Also Joe Armstrong used Prolog for prototyping, so Erlang and Elixir follow the same convention. 

  4. It still gets the right answers without this type restriction, but I had no idea it did until I checked for myself. Probably better not to rely on this! 

  5. We could make this even more compact by using a lambda function. First import module yall, then write maplist([X]>>member(X, [a,b]), Key). But (1) it's not a shorter program because you replace the extra definition with an extra module import, and (2) yall is SWI-Prolog specific and not an ISO-standard prolog module. Using contains is more portable. 

Betteridge's Law of Software Engineering Specialness

26 March 2025 at 18:48

Logic for Programmers v0.8 now out!

The new release has minor changes: new formatting for notes and a better introduction to predicates. I would have rolled it all into v0.9 next month but I like the monthly cadence. Get it here!

Betteridge's Law of Software Engineering Specialness

In There is No Automatic Reset in Engineering, Tim Ottinger asks:

Do the other people have to live with January 2013 for the rest of their lives? Or is it only engineering that has to deal with every dirty hack since the beginning of the organization?

Betteridge's Law of Headlines says that if a journalism headline ends with a question mark, the answer is probably "no". I propose a similar law relating to software engineering specialness:1

If someone asks if some aspect of software development is truly unique to just software development, the answer is probably "no".

Take the idea that "in software, hacks are forever." My favorite example of this comes from a different profession. The Dewey Decimal System hierarchically categorizes books by discipline. For example, Covered Bridges of Pennsylvania has Dewey number 624.37. 6-- is the technology discipline, 62- is engineering, 624 is civil engineering, and 624.3 is "special types of bridges". I have no idea what the last 0.07 means, but you get the picture.

Now if you look at the 6-- "technology" breakdown, you'll see that there's no "software" subdiscipline. This is because when Dewey preallocated the whole technology block in 1876. New topics were instead to be added to the 00- "general-knowledge" catch-all. Eventually 005 was assigned to "software development", meaning The C Programming Language lives at 005.133.

Incidentally, another late addition to the general knowledge block is 001.9: "controversial knowledge".

And that's why my hometown library shelved the C++ books right next to The Mothman Prophecies.

How's that for technical debt?

If anything, fixing hacks in software is significantly easier than in other fields. This came up when I was interviewing classic engineers. Kludges happened all the time, but "refactoring" them out is expensive. Need to house a machine that's just two inches taller than the room? Guess what, you're cutting a hole in the ceiling.

(Even if we restrict the question to other departments in a software company, we can find kludges that are horrible to undo. I once worked for a company which landed an early contract by adding a bespoke support agreement for that one customer. That plagued them for years afterward.)

That's not to say that there aren't things that are different about software vs other fields!2 But I think that most of the time, when we say "software development is the only profession that deals with XYZ", it's only because we're ignorant of how those other professions work.


Short newsletter because I'm way behind on writing my April Cools. If you're interested in April Cools, you should try it out! I make it way harder on myself than it actually needs to be— everybody else who participates finds it pretty chill.


  1. Ottinger caveats it with "engineering, software or otherwise", so I think he knows that other branches of engineering, at least, have kludges. 

  2. The "software is different" idea that I'm most sympathetic to is that in software, the tools we use and the products we create are made from the same material. That's unusual at least in classic engineering. Then again, plenty of machinists have made their own lathes and mills! 

Verification-First Development

18 March 2025 at 16:22

A while back I argued on the Blue Site1 that "test-first development" (TFD) was different than "test-driven development" (TDD). The former is "write tests before you write code", the latter is a paradigm, culture, and collection of norms that's based on TFD. More broadly, TFD is a special case of Verification-First Development and TDD is not.

VFD: before writing code, put in place some means of verifying that the code is correct, or at least have an idea of what you'll do.

"Verifying" could mean writing tests, or figuring out how to encode invariants in types, or adding contracts, or making a formal model, or writing a separate script that checks the output of the program. Just have something appropriate in place that you can run as you go building the code. Ideally, we'd have verification in place for every interesting property, but that's rarely possible in practice.

Oftentimes we can't make the verification until the code is partially complete. In that case it still helps to figure out the verification we'll write later. The point is to have a plan and follow it promptly.

I'm using "code" as a standin for anything we programmers make, not just software programs. When using constraint solvers, I try to find representative problems I know the answers to. When writing formal specifications, I figure out the system's properties before the design that satisfies those properties. There's probably equivalents in security and other topics, too.

The Benefits of VFD

  1. Doing verification before coding makes it less likely we'll skip verification entirely. It's the professional equivalent of "No TV until you do your homework."
  2. It's easier to make sure a verifier works properly if we start by running it on code we know doesn't pass it. Bebugging working code takes more discipline.
  3. We can run checks earlier in the development process. It's better to realize that our code is broken five minutes after we broke it rather than two hours after.

That's it, those are the benefits of verification-first development. Those are also big benefits for relatively little investment. Specializations of VFD like test-first development can have more benefits, but also more drawbacks.

The drawbacks of VFD

  1. It slows us down. I know lots of people say that "no actually it makes you go faster in the long run," but that's the long run. Sometimes we do marathons, sometimes we sprint.
  2. Verification gets in the way of exploratory coding, where we don't know what exactly we want or how exactly to do something.
  3. Any specific form of verification exerts a pressure on our code to make it easier to verify with that method. For example, if we're mostly verifying via type invariants, we need to figure out how to express those things in our language's type system, which may not be suited for the specific invariants we need.2

Whether "pressure" is a real drawback is incredibly controversial

If I had to summarize what makes "test-driven development" different from VFD:3

  1. The form of verification should specifically be tests, and unit tests at that
  2. Testing pressure is invariably good. "Making your code easier to unit test" is the same as "making your code better".

This is something all of the various "drivens"— TDD, Type Driven Development, Design by Contract— share in common, this idea that the purpose of the paradigm is to exert pressure. Lots of TDD experts claim that "having a good test suite" is only the secondary benefit of TDD and the real benefit is how it improves code quality.4

Whether they're right or not is not something I want to argue: I've seen these approaches all improve my code structure, but also sometimes worsen it. Regardless, I consider pressure a drawback to VFD in general, though, for a somewhat idiosyncratic reason. If it weren't for pressure, VFD would be wholly independent of the code itself. It would just be about verification, and our decisions would exclusively be about how we want to verify. But the design pressure means that our means of verification affects the system we're checking. What if these conflict in some way?

VFD is a technique, not a paradigm

One of the main differences between "techniques" and "paradigms" is that paradigms don't play well with each other. If you tried to do both "proper" Test-Driven Development and "proper" Cleanroom, your head would explode. Whereas VFD being a "technique" means it works well with other techniques and even with many full paradigms.

It also doesn't take a whole lot of practice to start using. It does take practice, both in thinking of verifications and in using the particular verification method involved, to use well, but we can use it poorly and still benefit.


  1. LinkedIn, what did you think I meant? 

  2. This bit me in the butt when making my own sphinx extensions. The official guides do things in a highly dynamic way that Mypy can't statically check. I had to do things in a completely different way. Ended up being better though! 

  3. Someone's going to yell at me that I completely missed the point of TDD, which is XYZ. Well guess what, someone else already yelled at me that only dumb idiot babies think XYZ is important in TDD. Put in whatever you want for XYZ. 

  4. Another thing that weirdly all of the paradigms claim: that they lead to better documentation. I can see the argument, I just find it strange that every single one makes this claim! 

New Blog Post: "A Perplexing Javascript Parsing Puzzle"

12 March 2025 at 14:49

I know I said we'd be back to normal newsletters this week and in fact had 80% of one already written.

Then I unearthed something that was better left buried.

Blog post here, Patreon notes here (Mostly an explanation of how I found this horror in the first place). Next week I'll send what was supposed to be this week's piece.

(PS: April Cools in three weeks!)

Five Kinds of Nondeterminism

19 February 2025 at 19:37

No newsletter next week, I'm teaching a TLA+ workshop.

Speaking of which: I spend a lot of time thinking about formal methods (and TLA+ specifically) because it's where the source of almost all my revenue. But I don't share most of the details because 90% of my readers don't use FM and never will. I think it's more interesting to talk about ideas from FM that would be useful to people outside that field. For example, the idea of "property strength" translates to the idea that some tests are stronger than others.

Another possible export is how FM approaches nondeterminism. A nondeterministic algorithm is one that, from the same starting conditions, has multiple possible outputs. This is nondeterministic:

# Pseudocode

def f() {
    return rand()+1;
}

When specifying systems, I may not encounter nondeterminism more often than in real systems, but I am definitely more aware of its presence. Modeling nondeterminism is a core part of formal specification. I mentally categorize nondeterminism into five buckets. Caveat, this is specifically about nondeterminism from the perspective of system modeling, not computer science as a whole. If I tried to include stuff on NFAs and amb operations this would be twice as long.1

1. True Randomness

Programs that literally make calls to a random function and then use the results. This the simplest type of nondeterminism and one of the most ubiquitous.

Most of the time, random isn't truly nondeterministic. Most of the time computer randomness is actually pseudorandom, meaning we seed a deterministic algorithm that behaves "randomly-enough" for some use. You could "lift" a nondeterministic random function into a deterministic one by adding a fixed seed to the starting state.

# Python

from random import random, seed
def f(x):
    seed(x)
    return random()

>>> f(3)
0.23796462709189137
>>> f(3)
0.23796462709189137

Often we don't do this because the point of randomness is to provide nondeterminism! We deliberately abstract out the starting state of the seed from our program, because it's easier to think about it as locally nondeterministic.

(There's also "true" randomness, like using thermal noise as an entropy source, which I think are mainly used for cryptography and seeding PRNGs.)

Most formal specification languages don't deal with randomness (though some deal with probability more broadly). Instead, we treat it as a nondeterministic choice:

# software
if rand > 0.001 then return a else crash

# specification
either return a or crash

This is because we're looking at worst-case scenarios, so it doesn't matter if crash happens 50% of the time or 0.0001% of the time, it's still possible.

2. Concurrency

# Pseudocode
global x = 1, y = 0;

def thread1() {
   x++;
   x++;
   x++;
}

def thread2() {
    y := x;
}

If thread1() and thread2() run sequentially, then (assuming the sequence is fixed) the final value of y is deterministic. If the two functions are started and run simultaneously, then depending on when thread2 executes y can be 1, 2, 3, or 4. Both functions are locally sequential, but running them concurrently leads to global nondeterminism.

Concurrency is arguably the most dramatic source of nondeterminism. Small amounts of concurrency lead to huge explosions in the state space. We have words for the specific kinds of nondeterminism caused by concurrency, like "race condition" and "dirty write". Often we think about it as a separate topic from nondeterminism. To some extent it "overshadows" the other kinds: I have a much easier time teaching students about concurrency in models than nondeterminism in models.

Many formal specification languages have special syntax/machinery for the concurrent aspects of a system, and generic syntax for other kinds of nondeterminism. In P that's choose. Others don't special-case concurrency, instead representing as it as nondeterministic choices by a global coordinator. This more flexible but also more inconvenient, as you have to implement process-local sequencing code yourself.

3. User Input

One of the most famous and influential programming books is The C Programming Language by Kernighan and Ritchie. The first example of a nondeterministic program appears on page 14:

Picture of the book page. Code reproduced below.

For the newsletter readers who get text only emails,2 here's the program:

#include 
/* copy input to output; 1st version */
main()
{
    int c;
    c = getchar();
    while (c != EOF) {
        putchar(c);
        c = getchar();
    }
}

Yup, that's nondeterministic. Because the user can enter any string, any call of main() could have any output, meaning the number of possible outcomes is infinity.

Okay that seems a little cheap, and I think it's because we tend to think of determinism in terms of how the user experiences the program. Yes, main() has an infinite number of user inputs, but for each input the user will experience only one possible output. It starts to feel more nondeterministic when modeling a long-standing system that's reacting to user input, for example a server that runs a script whenever the user uploads a file. This can be modeled with nondeterminism and concurrency: We have one execution that's the system, and one nondeterministic execution that represents the effects of our user.

(One intrusive thought I sometimes have: any "yes/no" dialogue actually has three outcomes: yes, no, or the user getting up and walking away without picking a choice, permanently stalling the execution.)

4. External forces

The more general version of "user input": anything where either 1) some part of the execution outcome depends on retrieving external information, or 2) the external world can change some state outside of your system. I call the distinction between internal and external components of the system the world and the machine. Simple examples: code that at some point reads an external temperature sensor. Unrelated code running on a system which quits programs if it gets too hot. API requests to a third party vendor. Code processing files but users can delete files before the script gets to them.

Like with PRNGs, some of these cases don't have to be nondeterministic; we can argue that "the temperature" should be a virtual input into the function. Like with PRNGs, we treat it as nondeterministic because it's useful to think in that way. Also, what if the temperature changes between starting a function and reading it?

External forces are also a source of nondeterminism as uncertainty. Measurements in the real world often comes with errors, so repeating a measurement twice can give two different answers. Sometimes operations fail for no discernable reason, or for a non-programmatic reason (like something physically blocks the sensor).

All of these situations can be modeled in the same way as user input: a concurrent execution making nondeterministic choices.

5. Abstraction

This is where nondeterminism in system models and in "real software" differ the most. I said earlier that pseudorandomness is arguably deterministic, but we abstract it into nondeterminism. More generally, nondeterminism hides implementation details of deterministic processes.

In one consulting project, we had a machine that received a message, parsed a lot of data from the message, went into a complicated workflow, and then entered one of three states. The final state was totally deterministic on the content of the message, but the actual process of determining that final state took tons and tons of code. None of that mattered at the scope we were modeling, so we abstracted it all away: "on receiving message, nondeterministically enter state A, B, or C."

Doing this makes the system easier to model. It also makes the model more sensitive to possible errors. What if the workflow is bugged and sends us to the wrong state? That's already covered by the nondeterministic choice! Nondeterministic abstraction gives us the potential to pick the worst-case scenario for our system, so we can prove it's robust even under those conditions.

I know I beat the "nondeterminism as abstraction" drum a whole lot but that's because it's the insight from formal methods I personally value the most, that nondeterminism is a powerful tool to simplify reasoning about things. You can see the same approach in how I approach modeling users and external forces: complex realities black-boxed and simplified into nondeterministic forces on the system.


Anyway, I hope this collection of ideas I got from formal methods are useful to my broader readership. Lemme know if it somehow helps you out!


  1. I realized after writing this that I already talked wrote an essay about nondeterminism in formal specification just under a year ago. I hope this one covers enough new ground to be interesting! 

  2. There is a surprising number of you. 

Are Efficiency and Horizontal Scalability at odds?

12 February 2025 at 18:26

Sorry for missing the newsletter last week! I started writing on Monday as normal, and by Wednesday the piece (about the hierarchy of controls ) was 2000 words and not close to done. So now it'll be a blog post sometime later this month.

I also just released a new version of Logic for Programmers! 0.7 adds a bunch of new content (type invariants, modeling access policies, rewrites of the first chapters) but more importantly has new fonts that are more legible than the old ones. Go check it out!

For this week's newsletter I want to brainstorm an idea I've been noodling over for a while. Say we have a computational task, like running a simulation or searching a very large graph, and it's taking too long to complete on a computer. There's generally three things that we can do to make it faster:

  1. Buy a faster computer ("vertical scaling")
  2. Modify the software to use the computer's resources better ("efficiency")
  3. Modify the software to use multiple computers ("horizontal scaling")

(Splitting single-threaded software across multiple threads/processes is sort of a blend of (2) and (3).)

The big benefit of (1) is that we (usually) don't have to make any changes to the software to get a speedup. The downside is that for the past couple of decades computers haven't gotten much faster, except in ways that require recoding (like GPUs and multicore). This means we rely on (2) and (3), and we can do both to a point. I've noticed, though, that horizontal scaling seems to conflict with efficiency. Software optimized to scale well tends to be worse or the N=1 case than software optimized to, um, be optimized.

Are there reasons to expect this? It seems reasonable that design goals of software are generally in conflict, purely because exclusively optimizing for one property means making decisions that impede other properties. But is there something in the nature of "efficiency" and "horizontal scalability" that make them especially disjoint?

This isn't me trying to explain a fully coherent idea, more me trying to figure this all out to myself. Also I'm probably getting some hardware stuff wrong

Amdahl's Law

According to Amdahl's Law, the maximum speedup by parallelization is constrained by the proportion of the work that can be parallelized. If 80% of algorithm X is parallelizable, the maximum speedup from horizontal scaling is 5x. If algorithm Y is 25% parallelizable, the maximum speedup is only 1.3x.

If you need horizontal scalability, you want to use algorithm X, even if Y is naturally 3x faster. But if Y was 4x faster, you'd prefer it to X. Maximal scalability means finding the optimal balance between baseline speed and parallelizability. Maximal efficiency means just optimizing baseline speed.

Coordination Overhead

Distributed algorithms require more coordination. To add a list of numbers in parallel via fork-join, we'd do something like this:

  1. Split the list into N sublists
  2. Fork a new thread/process for sublist
  3. Wait for each thread/process to finish
  4. Add the sums together.

(1), (2), and (3) all add overhead to the algorithm. At the very least, it's extra lines of code to execute, but it can also mean inter-process communication or network hops. Distribution also means you have fewer natural correctness guarantees, so you need more administrative overhead to avoid race conditions.

Real world example: Historically CPython has a "global interpreter lock" (GIL). In multithreaded code, only one thread could execute Python code at a time (others could execute C code). The newest version supports disabling the GIL, which comes at a 40% overhead for single-threaded programs. Supposedly the difference is because the specializing adaptor optimization isn't thread-safe yet. The Python team is hoping on getting it down to "only" 10%.

Scaling loses shared resources

I'd say that intra-machine scaling (multiple threads/processes) feels qualitatively different than inter-machine scaling. Part of that is that intra-machine scaling is "capped" while inter-machine is not. But there's also a difference in what assumptions you can make about shared resources. Starting from the baseline of single-threaded program:

  1. Threads have a much harder time sharing CPU caches (you have to manually mess with affinities)
  2. Processes have a much harder time sharing RAM (I think you have to use mmap?)
  3. Machines can't share cache, RAM, or disk, period.

It's a lot easier to solve a problem when the whole thing fits in RAM. But if you split a 50 gb problem across three machines, it doesn't fit in ram by default, even if the machines have 64 gb each. Scaling also means that separate machines can't reuse resources like database connections.

Efficiency comes from limits

I think the two previous points tie together in the idea that maximal efficiency comes from being able to make assumptions about the system. If we know the exact sequence of computations, we can aim to minimize cache misses. If we don't have to worry about thread-safety, tracking references is dramatically simpler. If we have all of the data in a single database, our query planner has more room to work with. At various tiers of scaling these assumptions are no longer guaranteed and we lose the corresponding optimizations.

Sometimes these assumptions are implicit and crop up in odd places. Like if you're working at a scale where you need multiple synced databases, you might want to use UUIDs instead of numbers for keys. But then you lose the assumption "recently inserted rows are close together in the index", which I've read can lead to significant slowdowns.

This suggests that if you can find a limit somewhere else, you can get both high horizontal scaling and high efficiency. Supposedly the TigerBeetle database has both, but that could be because they limit all records to accounts and transfers. This means every record fits in exactly 128 bytes. [A TigerBeetle engineer reached out to tell me that they do not horizontally scale compute, they distribute across multiple nodes for redundancy. "You can't make it faster by adding more machines."]

Does this mean that "assumptions" could be both "assumptions about the computing environment" and "assumptions about the problem"? In the famous essay Scalability! But at what COST, Frank McSherry shows that his single-threaded laptop could outperform 128-node "big data systems" on PageRank and graph connectivity (via label propagation). Afterwards, he discusses how a different algorithm solves graph connectivity even faster:

[Union find] is more line of code than label propagation, but it is 10x faster and 100x less embarassing. … The union-find algorithm is fundamentally incompatible with the graph computation approaches Giraph, GraphLab, and GraphX put forward (the so-called “think like a vertex” model).

The interesting thing to me is that his alternate makes more "assumptions" than what he's comparing to. He can "assume" a fixed goal and optimize the code for that goal. The "big data systems" are trying to be general purpose compute platforms and have to pick a model that supports the widest range of possible problems.

A few years back I wrote clever vs insightful code, I think what I'm trying to say here is that efficiency comes from having insight into your problem and environment.

(Last thought to shove in here: to exploit assumptions, you need control. Carefully arranging your data to fit in L1 doesn't matter if your programming language doesn't let you control where things are stored!)

Is there a cultural aspect?

Maybe there's also a cultural element to this conflict. What if the engineers interested in "efficiency" are different from the engineers interested in "horizontal scaling"?

At my first job the data scientists set up a Hadoop cluster for their relatively small dataset, only a few dozen gigabytes or so. One of the senior software engineers saw this and said "big data is stupid." To prove it, he took one of their example queries, wrote a script in Go to compute the same thing, and optimized it to run faster on his machine.

At the time I was like "yeah, you're right, big data IS stupid!" But I think now that we both missed something obvious: with the "scalable" solution, the data scientists didn't have to write an optimized script for every single query. Optimizing code is hard, adding more machines is easy!

The highest-tier of horizontal scaling is usually something large businesses want, and large businesses like problems that can be solved purely with money. Maximizing efficiency requires a lot of knowledge-intensive human labour, so is less appealing as an investment. Then again, I've seen a lot of work on making the scalable systems more efficient, such as evenly balancing heterogeneous workloads. Maybe in the largest systems intra-machine efficiency is just too small-scale a problem.

I'm not sure where this fits in but scaling a volume of tasks conflicts less than scaling individual tasks

If you have 1,000 machines and need to crunch one big graph, you probably want the most scalable algorithm. If you instead have 50,000 small graphs, you probably want the most efficient algorithm, which you then run on all 1,000 machines. When we call a problem embarrassingly parallel, we usually mean it's easy to horizontally scale. But it's also one that's easy to make more efficient, because local optimizations don't affect the scaling!


Okay that's enough brainstorming for one week.

Blog Rec

Whenever I think about optimization as a skill, the first article that comes to mind is Mat Klad's Push Ifs Up And Fors Down. I'd never have considered on my own that inlining loops into functions could be such a huge performance win. The blog has a lot of other posts on the nuts-and-bolts of systems languages, optimization, and concurrency.

What hard thing does your tech make easy?

29 January 2025 at 18:09

I occasionally receive emails asking me to look at the writer's new language/library/tool. Sometimes it's in an area I know well, like formal methods. Other times, I'm a complete stranger to the field. Regardless, I'm generally happy to check it out.

When starting out, this is the biggest question I'm looking to answer:

What does this technology make easy that's normally hard?

What justifies me learning and migrating to a new thing as opposed to fighting through my problems with the tools I already know? The new thing has to have some sort of value proposition, which could be something like "better performance" or "more secure". The most universal value and the most direct to show is "takes less time and mental effort to do something". I can't accurately judge two benchmarks, but I can see two demos or code samples and compare which one feels easier to me.

Examples

Functional programming

What drew me originally to functional programming was higher order functions.

# Without HOFs

out = []
for x in input {
  if test(x) {
    out.append(x)
 }
}

# With HOFs

filter(test, input)

We can also compare the easiness of various tasks between examples within the same paradigm. If I know FP via Clojure, what could be appealing about Haskell or F#? For one, null safety is a lot easier when I've got option types.

Array Programming

Array programming languages like APL or J make certain classes of computation easier. For example, finding all of the indices where two arrays differ match. Here it is in Python:

x = [1, 4, 2, 3, 4, 1, 0, 0, 0, 4]
y = [2, 3, 1, 1, 2, 3, 2, 0, 2, 4]

>>> [i for i, (a, b) in enumerate(zip(x, y)) if a == b]
[7, 9]

And here it is in J:

  x =: 1 4 2 3 4 1 0 0 0 4
  y =: 2 3 1 1 2 3 2 0 2 4

  I. x = y
7 9

Not every tool is meant for every programmer, because you might not have any of the problems a tool makes easier. What comes up more often for you: filtering a list or finding all the indices where two lists differ? Statistically speaking, functional programming is more useful to you than array programming.

But I have this problem enough to justify learning array programming.

LLMs

I think a lot of the appeal of LLMs is they make a lot of specialist tasks easy for nonspecialists. One thing I recently did was convert some rst list tables to csv tables. Normally I'd have to do write some tricky parsing and serialization code to automatically convert between the two. With LLMs, it's just

Convert the following rst list-table into a csv-table: [table]

"Easy" can trump "correct" as a value. The LLM might get some translations wrong, but it's so convenient I'd rather manually review all the translations for errors than write specialized script that is correct 100% of the time.

Let's not take this too far

A college friend once claimed that he cracked the secret of human behavior: humans do whatever makes them happiest. "What about the martyr who dies for their beliefs?" "Well, in their last second of life they get REALLY happy."

We can do the same here, fitting every value proposition into the frame of "easy". CUDA makes it easier to do matrix multiplication. Rust makes it easier to write low-level code without memory bugs. TLA+ makes it easier to find errors in your design. Monads make it easier to sequence computations in a lazy environment. Making everything about "easy" obscures other reason for adopting new things.

That whole "simple vs easy" thing

Sometimes people think that "simple" is better than "easy", because "simple" is objective and "easy" is subjective. This comes from the famous talk Simple Made Easy. I'm not sure I agree that simple is better or more objective: the speaker claims that polymorphism and typeclasses are "simpler" than conditionals, and I doubt everybody would agree with that.

The problem is that "simple" is used to mean both "not complicated" and "not complex". And everybody agrees that "complicated" and "complex" are different, even if they can't agree what the difference is. This idea should probably expanded be expanded into its own newsletter.

It's also a lot harder to pitch a technology on being "simpler". Simplicity by itself doesn't make a tool better equipped to solve problems. Simplicity can unlock other benefits, like compositionality or tractability, that provide the actual value. And often that value is in the form of "makes some tasks easier".

The Juggler's Curse

22 January 2025 at 18:50

I'm making a more focused effort to juggle this year. Mostly boxes, but also classic balls too.1 I've gotten to the point where I can almost consistently do a five-ball cascade, which I thought was the cutoff to being a "good juggler". "Thought" because I now know a "good juggler" is one who can do the five-ball cascade with outside throws.

I know this because I can't do the outside five-ball cascade... yet. But it's something I can see myself eventually mastering, unlike the slightly more difficult trick of the five-ball mess, which is impossible for mere mortals like me.

In theory there is a spectrum of trick difficulties and skill levels. I could place myself on the axis like this:

A crudely-drawn scale with 10 even ticks, I'm between 5 and 6

In practice, there are three tiers:

  1. Toddlers
  2. Good jugglers who practice hard
  3. Genetic freaks and actual wizards

And the graph always, always looks like this:

The same graph, with the top compressed into

This is the jugglers curse, and it's a three-parter:

  1. The threshold between you and "good" is the next trick you cannot do.
  2. Everything below that level is trivial. Once you've gotten a trick down, you can never go back to not knowing it, to appreciating how difficult it was to learn in the first place.2
  3. Everything above that level is just "impossible". You don't have the knowledge needed to recognize the different tiers.3

So as you get better, the stuff that was impossible becomes differentiable, and you can see that some of it is possible. And everything you learned becomes trivial. So you're never a good juggler until you learn "just one more hard trick".

The more you know, the more you know you don't know and the less you know you know.

This is supposed to be a software newsletter

A monad is a monoid in the category of endofunctors, what's the problem? (src)

I think this applies to any difficult topic? Most fields don't have the same stark spectral lines as juggling, but there's still tiers of difficulty to techniques, which get compressed the further in either direction they are from your current level.

Like, I'm not good at formal methods. I've written two books on it but I've never mastered a dependently-typed language or a theorem prover. Those are equally hard. And I'm not good at modeling concurrent systems because I don't understand the formal definition of bisimulation and haven't implemented a Raft. Those are also equally hard, in fact exactly as hard as mastering a theorem prover.

At the same time, the skills I've already developed are easy: properly using refinement is exactly as easy as writing a wrapped counter. Then I get surprised when I try to explain strong fairness to someone and they just don't get how □◇(ENABLED〈A〉ᵥ) is obviously different from ◇□(ENABLED 〈A〉ᵥ).

Juggler's curse!

Now I don't actually know if this is actually how everybody experiences expertise or if it's just my particular personality— I was a juggler long before I was a software developer. Then again, I'd argue that lots of people talk about one consequence of the juggler's curse: imposter syndrome. If you constantly think what you know is "trivial" and what you don't know is "impossible", then yeah, you'd start feeling like an imposter at work real quick.

I wonder if part of the cause is that a lot of skills you have to learn are invisible. One of my favorite blog posts ever is In Defense of Blub Studies, which argues that software expertise comes through understanding "boring" topics like "what all of the error messages mean" and "how to use a debugger well". Blub is a critical part of expertise and takes a lot of hard work to learn, but it feels like trivia. So looking back on a skill I mastered, I might think it was "easy" because I'm not including all of the blub that I had to learn, too.

The takeaway, of course, is that the outside five-ball cascade is objectively the cutoff between good jugglers and toddlers.


  1. Rant time: I love cigar box juggling. It's fun, it's creative, it's totally unlike any other kind of juggling. And it's so niche I straight up cannot find anybody in Chicago to practice with. I once went to a juggling convention and was the only person with a cigar box set there. 

  2. This particular part of the juggler's curse is also called the curse of knowledge or "expert blindness". 

  3. This isn't Dunning-Kruger, because DK says that people think they are better than they actually are, and also may not actually be real

What are the Rosettas of formal specification?

15 January 2025 at 17:34

First of all, I just released version 0.6 of Logic for Programmers! You can get it here. Release notes in the footnote.1

I've been thinking about my next project after the book's done. One idea is to do a survey of new formal specification languages. There's been a lot of new ones in the past few years (P, Quint, etc), plus some old ones I haven't critically examined (SPIN, mcrl2). I'm thinking of a brief overview of each, what's interesting about it, and some examples of the corresponding models.

For this I'd want a set of "Rosetta" examples. Rosetta Code is a collection of programming tasks done in different languages. For example, "99 bottles of beer on the wall" in over 300 languages. If I wanted to make a Rosetta Code for specifications of concurrent systems, what examples would I use?

What makes a good Rosetta examples?

A good Rosetta example would be simple enough to understand and implement but also showcase the differences between the languages.

A good example of a Rosetta example is leftpad for code verification. Proving leftpad correct is short in whatever verification language you use. But the proofs themselves are different enough that you can compare what it's like to use code contracts vs with dependent types, etc.

A bad Rosetta example is "hello world". While it's good for showing how to run a language, it doesn't clearly differentiate languages. Haskell's "hello world" is almost identical to BASIC's "hello world".

Rosetta examples don't have to be flashy, but I want mine to be flashy. Formal specification is niche enough that regardless of my medium, most of my audience hasn't use it and may be skeptical. I always have to be selling. This biases me away from using things like dining philosophers or two-phase commit.

So with that in mind, three ideas:

1. Wrapped Counter

A counter that starts at 1 and counts to N, after which it wraps around to 1 again.

Why it's good

This is a good introductory formal specification: it's a minimal possible stateful system without concurrency or nondeterminism. You can use it to talk about the basic structure of a spec, how a verifier works, etc. It also a good way of introducing "boring" semantics, like conditionals and arithmetic, and checking if the language does anything unusual with them. Alloy, for example, defaults to 4-bit signed integers, so you run into problems if you set N too high.2

At the same time, wrapped counters are a common building block of complex systems. Lots of things can be represented this way: N=1 is a flag or blinker, N=3 is a traffic light, N=24 is a clock, etc.

The next example is better for showing basic safety and liveness properties, but this will do in a pinch.

2. Threads

A counter starts at 0. N threads each, simultaneously try to update the counter. They do this nonatomically: first they read the value of the counter and store that in a thread-local tmp, then they increment tmp, then they set the counter to tmp. The expected behavior is that the final value of the counter will be N.

Why it's good

The system as described is bugged. If two threads interleave the setlocal commands, one thread update can "clobber" the other and the counter can go backwards. To my surprise, most people do not see this error. So it's a good showcase of how the language actually finds real bugs, and how it can verify fixes.

As to actual language topics: the spec covers concurrency and track process-local state. A good spec language should make it possible to adjust N without having to add any new variables. And it "naturally" introduces safety, liveness, and action properties.

Finally, the thread spec is endlessly adaptable. I've used variations of it to teach refinement, resource starvation, fairness, livelocks, and hyperproperties. Tweak it a bit and you get dining philosophers.

3. Bounded buffer

We have a bounded buffer with maximum length X. We have R reader and W writer processes. Before writing, writers first check if the buffer is full. If full, the writer goes to sleep. Otherwise, the writer wakes up a random sleeping process, then pushes an arbitrary value. Readers work the same way, except they pop from the buffer (and go to sleep if the buffer is empty).

The only way for a sleeping process to wake up is if another process successfully performs a read or write.

Why it's good

This shows process-local nondeterminism (in choosing which sleeping process to wake up), different behavior for different types of processes, and deadlocks: it's possible for every reader and writer to be asleep at the same time.

The beautiful thing about this example: the spec can only deadlock if X < 2*(R+W). This is the kind of bug you'd struggle to debug in real code. An in fact, people did struggle: even when presented with a minimal code sample and told there was a bug, many testing experts couldn't find it. Whereas a formal model of the same code finds the bug in seconds.

If a spec language can model the bounded buffer, then it's good enough for production systems.

On top of that, the bug happens regardless of what writers actually put in the buffer, so you can abstract that all away. This example can demonstrate that you can leave implementation details out of a spec and still find critical errors.

Caveat

This is all with a heavy TLA+ bias. I've modeled all of these systems in TLA+ and it works pretty well for them. That is to say, none of these do things TLA+ is bad at: reachability, subtyping, transitive closures, unbound spaces, etc. I imagine that as I cover more specification languages I'll find new Rosettas.


    • Exercises are more compact, answers now show name of exercise in title
    • "Conditionals" chapter has new section on nested conditionals
    • "Crash course" chapter significantly rewritten
    • Starting migrating to use consistently use == for equality and = for definition. Not everything is migrated yet
    • "Beyond Logic" appendix does a slightly better job of covering HOL and constructive logic
    • Addressed various reader feedback
    • Two new exercises

  1. You can change the int size in a model run, so this is more "surprising footgun and inconvenience" than "fundamental limit of the specification language." Something still good to know! 

"Logic for Programmers" Project Update

7 January 2025 at 18:49

Happy new year everyone!

I released the first Logic for Programmers alpha six months ago. There's since been four new versions since then, with the November release putting us in beta. Between work and holidays I didn't make much progress in December, but there will be a 0.6 release in the next week or two.

People have asked me if the book will ever be available in print, and my answer to that is "when it's done". To keep "when it's done" from being "never", I'm committing myself to have the book finished by July. That means roughly six more releases between now and the official First Edition. Then I will start looking for a way to get it printed.

The Current State and What Needs to be Done

Right now the book is 26,000 words. For the most part, the structure is set— I don't plan to reorganize the chapters much. But I still need to fix shortcomings identified by the reader feedback. In particular, a few topics need more on real world applications, and the Alloy chapter is pretty weak. There's also a bunch of notes and todos and "fix this"s I need to go over.

I also need to rewrite the introduction and predicate logic chapters. Those haven't changed much since 0.1 and I need to go over them very carefully.

After that comes copyediting.

Ugh, Copyediting

Copyediting means going through the entire book to make word and sentence sentence level changes to the flow. An example would be changing

From To
I said predicates are just “boolean functions”. That isn’t quite true. It's easy to think of predicates as just "boolean" functions, but there is a subtle and important difference.

It's a tiny difference but it reads slightly better to me and makes the book slghtly better. Now repeat that for all 3000-odd sentences in the book and I'm done with copyediting!

For the first pass, anyway. Copyediting is miserable.

Some of the changes I need to make come from reader feedback, but most will come from going through it line-by-line with a copyeditor. Someone's kindly offered to do some of this for free, but I want to find a professional too. If you know anybody, let me know.

Formatting

The book, if I'm being honest, looks ugly. I'm using the default sphinx/latex combination for layout and typesetting. My thinking is it's not worth making the book pretty until it's worth reading. But I also want the book, when it's eventually printed, to look nice. At the very least it shouldn't have "self-published" vibes.

I've found someone who's been giving me excellent advice on layout and I'm slowly mastering the LaTeX formatting arcana. It's gonna take a few iterations to get things right.

Front cover

Currently the front cover is this:

Front cover

It works but gives "programmer spent ten minutes in Inkscape" vibes. I have a vision in my head for what would be nicer. A few people have recommended using Fiverr. So far the results haven't been that good,

Fixing Epub

Ugh

I thought making an epub version would be kinder for phone reading, but it's such a painful format to develop for. Did you know that epub backlinks work totally different on kindle vs other ereaders? Did you know the only way to test if you got em working right is to load them up in a virtual kindle? The feedback loops are miserable. So I've been treating epub as a second-class citizen for now and only fixing the worst errors (like math not rendering properly), but that'll have to change as the book finalizes.

What comes next?

After 1.0, I get my book an ISBN and figure out how to make print copies. The margin on print is way lower than ebooks, especially if it's on-demand: the net royalties for Amazon direct publishing would be 7 dollars on a 20-dollar book (as opposed to Leanpub's 16 dollars). Would having a print version double the sales? I hope so! Either way, a lot of people have been asking about print version so I want to make that possible.

(I also want to figure out how to give people who already have the ebook a discount on print, but I don't know if that's feasible.)

Then, I dunno, maybe make a talk or a workshop I can pitch to conferences. Once I have that I think I can call LfP complete... at least until the second edition.


Anyway none of that is actually technical so here's a quick fun thing. I spent a good chunk of my break reading the mCRL2 book. mCRL2 defines an "algebra" for "communicating processes". As a very broad explanation, that's defining what it means to "add" and "multiply" two processes. What's interesting is that according to their definition, the algebra follows the distributive law, but only if you multiply on the right. eg

// VALID
(a+b)*c = a*c + b*c

// INVALID
a*(b+c) = a*b + a*c

This is the first time I've ever seen this in practice! Juries still out on the rest of the language.


Videos and Stuff

  • My DDD Europe talk is now out! What We Know We Don't Know is about empirical software engineering in general, and software engineering research on Domain Driven Design in particular.
  • I was interviewed in the last video on Craft vs Cruft's "Year of Formal Methods". Check it out!

Formally modeling dreidel, the sequel

18 December 2024 at 16:58

Channukah's next week and that means my favorite pastime, complaining about how Dreidel is a bad game. Last year I formally modeled it in PRISM to prove the game's not fun. But because I limited the model to only a small case, I couldn't prove the game was truly bad.

It's time to finish the job.

A flaming dreidel, from https://pixelsmerch.com/featured/flaming-dreidel-ilan-rosen.html

The Story so far

You can read the last year's newsletter here but here are the high-level notes.

The Game of Dreidel

  1. Every player starts with N pieces (usually chocolate coins). This is usually 10-15 pieces per player.
  2. At the beginning of the game, and whenever the pot is empty, every play antes one coin into the pot.
  3. Turns consist of spinning the dreidel. Outcomes are:

    • נ (Nun): nothing happens.
    • ה (He): player takes half the pot, rounded up.
    • ג (Gimmel): player takes the whole pot, everybody antes.
    • ש (Shin): player adds one of their coins to the pot.
  4. If a player ever has zero coins, they are eliminated. Play continues until only one player remains.

If you don't have a dreidel, you can instead use a four-sided die, but for the authentic experience you should wait eight seconds before looking at your roll.

PRISM

PRISM is a probabilistic modeling language, meaning you can encode a system with random chances of doing things and it can answer questions like "on average, how many spins does it take before one player loses" (64, for 4 players/10 coins) and "what's the more likely to knock the first player out, shin or ante" (ante is 2.4x more likely). You can see last year's model here.

The problem with PRISM is that it is absurdly inexpressive: it's a thin abstraction for writing giant stochastic matrices and lacks basic affordances like lists or functions. I had to hardcode every possible roll for every player. This meant last year's model had two limits. First, it only handles four players, and I would have to write a new model for three or five players. Second, I made the game end as soon as one player lost:

formula done = (p1=0) | (p2=0) | (p3=0) | (p4=0);

To fix both of these things, I thought I'd have to treat PRISM as a compilation target, writing a program that took a player count and output the corresponding model. But then December got super busy and I ran out of time to write a program. Instead, I stuck with four hardcoded players and extended the old model to run until victory.

The new model

These are all changes to last year's model.

First, instead of running until one player is out of money, we run until three players are out of money.

- formula done = (p1=0) | (p2=0) | (p3=0) | (p4=0);
+ formula done = 
+  ((p1=0) & (p2=0) & (p3=0)) |
+  ((p1=0) & (p2=0) & (p4=0)) |
+  ((p1=0) & (p3=0) & (p4=0)) |
+  ((p2=0) & (p3=0) & (p4=0));

Next, we change the ante formula. Instead of adding four coins to the pot and subtracting a coin from each player, we add one coin for each player left. min(p1, 1) is 1 if player 1 is still in the game, and 0 otherwise.

+ formula ante_left = min(p1, 1) + min(p2, 1) + min(p3, 1) + min(p4, 1);

We also have to make sure anteing doesn't end a player with negative money.

- [ante] (pot = 0) & !done -> (pot'=pot+4) & (p1' = p1-1) & (p2' = p2-1) & (p3' = p3-1) & (p4' = p4-1);
+ [ante] (pot = 0) & !done -> (pot'=pot+ante_left) & (p1' = max(p1-1, 0)) & (p2' = max(p2-1, 0)) & (p3' = max(p3-1, 0)) & (p4' = max(p4-1, 0));

Finally, we have to add logic for a player being "out". Instead of moving to the next player after each turn, we move to the next player still in the game. Also, if someone starts their turn without any coins (f.ex if they just anted their last coin), we just skip their turn.

+ formula p1n = (p2 > 0 ? 2 : p3 > 0 ? 3 : 4);

+ [lost] ((pot != 0) & !done & (turn = 1) & (p1 = 0)) -> (turn' = p1n);
- [spin] ((pot != 0) & !done & (turn = 1)) ->
+ [spin] ((pot != 0) & !done & (turn = 1) & (p1 != 0)) ->
    0.25: (p1' = p1-1) 
           & (pot' = min(pot+1, maxval)) 
-          & (turn' = 2) //shin
+          & (turn' = p1n) //shin

We make similar changes for all of the other players. You can see the final model here.

Querying the model

So now we have a full game of Dreidel that runs until the player ends. And now, finally, we can see the average number of spins a 4 player game will last.

./prism dreidel.prism -const M=10 -pf 'R=? [F done]' 

In English: each player starts with ten coins. R=? means "expected value of the 'reward'", where 'reward' in this case means number of spins. [F done] weights the reward over all behaviors that reach ("Finally") the done state.

Result: 760.5607582661091
Time for model checking: 384.17 seconds.

So there's the number: 760 spins.1 At 8 seconds a spin, that's almost two hours for one game.

…Jesus, look at that runtime. Six minutes to test one query.

PRISM has over a hundred settings that affect model checking, with descriptions like "Pareto curve threshold" and "Use Backwards Pseudo SOR". After looking through them all, I found this perfect combination of configurations that gets the runtime to a more manageable level:

./prism dreidel.prism 
    -const M=10 
    -pf 'R=? [F done]' 
+   -heuristic speed

Result: 760.816255997373
Time for model checking: 13.44 seconds.

Yes, that's a literal "make it faster" flag.

Anyway, that's only the "average" number of spins, weighted across all games. Dreidel has a very long tail. To find that out, we'll use a variation on our query:

const C0; P=? [F <=C0 done]

P=? is the Probability something happens. F <=C0 done means we Finally reach state done in at most C0 steps. By passing in different values of C0 we can get a sense of how long a game takes. Since "steps" includes passes and antes, this will overestimate the length of the game. But antes take time too and it should only "pass" on a player once per player, so this should still be a good metric for game length.

./prism dreidel.prism 
    -const M=10 
    -const C0=1000:1000:5000
    -pf 'const C0; P=? [F <=C0 done]' 
    -heuristic speed

C0      Result
1000    0.6259953274918795
2000    0.9098575028069353
3000    0.9783122218576754
4000    0.994782069562932
5000    0.9987446018004976

A full 10% of games don't finish in 2000 steps, and 2% pass the 3000 step barrier. At 8 seconds a roll/ante, 3000 steps is over six hours.

Dreidel is a bad game.

More fun properties

As a sanity check, let's confirm last year's result, that it takes an average of 64ish spins before one player is out. In that model, we just needed to get the total reward. Now we instead want to get the reward until the first state where any of the players have zero coins. 2

./prism dreidel.prism 
    -const M=10 
    -pf 'R=? [F (p1=0 | p2=0 | p3=0 | p4=0)]' 
    -heuristic speed

Result: 63.71310116083396
Time for model checking: 2.017 seconds.

Yep, looks good. With our new model we can also get the average point where two players are out and two players are left. PRISM's lack of abstraction makes expressing the condition directly a little painful, but we can cheat and look for the first state where ante_left <= 2.3

./prism dreidel.prism 
    -const M=10 
    -pf 'R=? [F (ante_left <= 2)]' 
    -heuristic speed

Result: 181.92839196680023

It takes twice as long to eliminate the second player as it takes to eliminate the first, and the remaining two players have to go for another 600 spins.

Dreidel is a bad game.

The future

There's two things I want to do next with this model. The first is script up something that can generate the PRISM model for me, so I can easily adjust the number of players to 3 or 5. The second is that PRISM has a filter-query feature I don't understand but I think it could be used for things like "if a player gets 75% of the pot, what's the probability they lose anyway". Otherwise you have to write wonky queries like (P =? [F p1 = 30 & (F p1 = 0)]) / (P =? [F p1 = 0]).4 But I'm out of time again, so this saga will have to conclude next year.

I'm also faced with the terrible revelation that I might be the biggest non-academic user of PRISM.


Logic for Programmers Khanukah Sale

Still going on! You can get LFP for 40% off here from now until the end of Xannukkah (Jan 2).5

I'm in the Raku Advent Calendar!

My piece is called counting up concurrencies. It's about using Raku to do some combinatorics! Read the rest of the blog too, it's great


  1. This is different from the original anti-Dreidel article: Ben got 860 spins. That's the average spins if you round down on He, not up. Rounding up on He leads to a shorter game because it means He can empty the pot, which means more antes, and antes are what knocks most players out. 

  2. PRISM calls this "co-safe LTL reward" and does not explain what that means, nor do most of the papers I found referencing "co-safe LTL". Eventually I found one that defined it as "any property that only uses X, U, F". 

  3. Here's the exact point where I realize I could have defined done as ante_left = 1. Also checking for F (ante_left = 2) gives an expected number of spins as "infinity". I have no idea why. 

  4. 10% chances at 4 players / 10 coins. And it takes a minute even with fast mode enabled. 

  5. This joke was funnier before I made the whole newsletter about Chanukahh. 

Stroustrup's Rule

11 December 2024 at 17:32

Just finished two weeks of workshops and am exhausted, so this one will be light.

Hanuka Sale

Logic for Programmers is on sale until the end of Chanukah! That's Jan 2nd if you're not Jewish. Get it for 40% off here.

Stroustrup's Rule

I first encountered Stroustrup's Rule on this defunct webpage:

One of my favorite insights about syntax design appeared in a retrospective on C++1 by Bjarne Stroustrup:

  • For new features, people insist on LOUD explicit syntax.
  • For established features, people want terse notation.

The blogger gives the example of option types in Rust. Originally, the idea of using option types to store errors was new for programmers, so the syntax for passing an error was very explicit:

let file = match File::open("file.txt") {
    Ok(file) => file,
    Err(err) => { return err; }
}

Once people were more familiar with it, Rust added the try! macro to reduce boilerplate, and finally the ? operator to streamline error handling further.

I see this as a special case of mental model development: when a feature is new to you, you don't have an internal mental model so need all of the explicit information you can get. Once you're familiar with it, explicit syntax is visual clutter and hinders how quickly you can parse out information.

(One example I like: which is more explicit, user_id or user_identifier? Which do experienced programmers prefer?)

What's interesting is that it's often the same people on both sides of the spectrum. Beginners need explicit syntax, and as they become experts, they prefer terse syntax.

The rule applies to the overall community, too. At the beginning of a language's life, everybody's a beginner. Over time the ratio of experts to beginners changes, and this leads to more focus on "expert-friendly" features, like terser syntax.

This can make it harder for beginners to learn the language. There was a lot of drama in Python over the "walrus" assignment operator:

# Without walrus
val = dict.get(key) # `None` if key absent
if val:
    print(val)


# With walrus
if val := dict.get(key):
    print(val)

Experts supported it because it made code more elegant, teachers and beginners opposed it because it made the language harder to learn. Explicit syntax vs terse notation.

Does this lead to languages bloating over time?

In Teaching

I find that when I teach language workshops I have to actively work against Stroustrup's Rule. The terse notation that easiest for me to read is bad for beginners, who need the explicit syntax that I find grating.

One good example is type invariants in TLA+. Say you have a set of workers, and each worker has a counter. Here's two ways to say that every worker's counter is a non-negative integer:

\* Bad
\A w \in Workers: counter[w] >= 0

\* Good
counter \in [Workers -> Nat]

The first way literally tests that for every worker, counter[w] is non-negative. The second way tests that the counter mapping as a whole is an element of the appropriate "function set"— all functions between workers and natural numbers.

The function set approach is terser, more elegant, and preferred by TLA+ experts. But I teach the "bad" way because it makes more sense to beginners.


  1. Starts minute 23. 

Hyperproperties

19 November 2024 at 19:34

I wrote about hyperproperties on my blog four years ago, but now an intriguing client problem got me thinking about them again.1

We're using TLA+ to model a system that starts in state A, and under certain complicated conditions P, transitions to state B. They also had a flag f that, when set, used a different complicated condition Q to check the transitions. As a quick decision table (from state A):

f P Q state'
F F - A
F T - B
T F F A
T F T B
T T F impossible
T T T B

The interesting bit is the second-to-last row: Q has to be strictly more permissible than P. The client wanted to verify the property that "the system more aggressively transitions when f is set", ie there is no case where the machine transitions only if f is false.

Regular system properties are specified over states in a single sequence of states (behaviors). Hyperproperties can hold over sets of sequences of states. Here the hyperproperties are:

  1. For any two states X and Y in separate behaviors, if the only difference in variable-state between X and Y is that X.f = TRUE, then whenever Y transitions to B, so does X.
  2. There is at least one such case where X transitions and Y does not.

That's pretty convoluted, which is par for the course with hyperproperties! It makes a little more sense if you have all of the domain knowledge and specifics.

The key thing is that makes this a hyperproperty is that you can't just look at individual behaviors to verify it. Imagine if, when f is true, we never transition to state B. Is that a violation of (1)? Not if we never transition when f is false either! To prove a violation, you need to find a behavior where f is false and the state is otherwise the same and we transition to B anyway.

Aside: states in states in states

I dislike how "state" refers to three things:

  1. The high-level "transition state" of a state-machine
  2. A single point in time of a system (the "state space")
  3. The mutable data inside your system's machine.

These are all "close" to each other but just different enough to make conversations confusing. Software is pretty bad about reusing colloquial words like this; don't even get me started on the word "design".

There's a reason we don't talk about hyperproperties

Or three reasons. First of all, hyperproperties make up a vanishingly small percentage of the stuff in a system we care about. We only got to "f makes the system more aggressive" after checking at least a dozen other simpler and more important not-hyper properties.

Second, most formal specification languages can't express hyperproperties, and the ones that can are all academic research projects. Modeling systems is hard enough without a generalized behavior notation!

Third, hyperproperties are astoundingly expensively to check. As an informal estimation, for a state space of size N regular properties are checked across N individual states and 2-behavior hyperproperties (2-props) are checked across pairs. So for a small state space of just a million states, the 2-prop needs to be checked across a trillion pairs.

These problems don't apply to "hyperproperties" of functions, just systems. Functions have a lot of interesting hyperproperties, there's an easy way to represent them (call the function twice in a test), and quadratic scaling isn't so bad if you're only testing 100 inputs or so. That's why so-called metamorphic testing of functions can be useful.

Checking Hyperproperties Anyway

If we do need to check a hyperproperty, there's a few ways we can approach it.

The easiest way is to cheat and find a regular prop that implies the hyperproperty. In client's case, we can abstract P and Q into pure functions and then test that there's no input where P is true and Q is false. In TLA+, this would look something like

\* TLA+
QLooserThanP ==
  \A i1 \in InputSet1, i2 \in Set2: \* ...
    P(i1, i2, …) => Q(i1, i2, …)

Of course we can't always encapsulate this way, and this can't catch bugs like "we accidentally use P even if f is true". But it gets the job done.

Another way is something I talked about in the original hyperproperty post: lifting specs into hyperspecs. We create a new spec that initializes two copies of our main spec, runs them in parallel, and then compares their behaviors. See the post for an example. Writing a hyperspec keeps us entirely in TLA+ but takes a lot of work and is very expensive to check. Depending on the property we want to check, we can sometimes find simple optimizations.

The last way is something I explored last year: dump the state graph to disk and treat the hyperproperty as a graph property. In this case, the graph property would be something like

Find all graph edges representing an A → B transition. Take all the source nodes of each where f = false. For each such source node, find the corresponding node that's identical except for f = true. That node should be the source of an A → B edge.

Upside is you don't have to make any changes to the original spec. Downside is you have to use another programming language for analysis. Also, analyzing graphs is terrible. But I think this overall the most robust approach to handling hyperproperties, to be used when "cheating" fails.


What fascinates me most about this is the four-year gap between "I learned and wrote about hyperproperties" and "I have to deal with hyperproperties in my job." This is one reason learning for the sake of learning can have a lot of long-term benefits.


Blog Rec

This week's rec is Robert Heaton. It's a "general interest" software engineering blog with a focus on math, algorithms, and security. Some of my favorites:

Other good ones are PySkyWiFi: completely free, unbelievably stupid wi-fi on long-haul flights and How to pass a coding interview with me. The guy's got breadth.


  1. I do formal methods consulting btw. Hire me! 

Five Unusual Raku Features

12 November 2024 at 20:06

Logic for Programmers is now in Beta!

v0.5 marks the official end of alpha! With the new version, all of the content I wanted to put in the book is now present, and all that's left is copyediting, proofreading, and formatting. Which will probably take as long as it took to actually write the book. You can see the release notes in the footnote.1

And I've got a snazzy new cover:

The logic for programmers cover, a 40x zoom of a bird feather

(I don't actually like the cover that much but it looks official enough until I can pay an actual cover designer.)

"Five" Unusual Raku Features

Last year I started learning Raku, and the sheer bizarreness of the language left me describing it as a language for gremlins. Now that I've used it in anger for over a year, I have a better way of describing it:

Raku is a laboratory for language features.

This is why it has five different models of concurrency and eighteen ways of doing anything else, because the point is to see what happens. It also explains why many of the features interact so strangely and why there's all that odd edge-case behavior. Getting 100 experiments polished and playing nicely with each other is much harder than running 100 experiments; we can sort out the polish after we figure out which ideas are good ones.

So here are "five" Raku experiments you could imagine seeing in another programming language. If you squint.

Junctions

Junctions are "superpositions of possible values". Applying an operation to a junction instead applies it to every value inside the junction.

> 2|10
any(2, 10)

> 2&10 + 3
all(5, 13)

>(1&2) + (10^20)
all(one(11, 21), one(12, 22))

As you can probably tell from the alls and anys, junctions are a feature meant for representing boolean formula. There's no way to destructure a junction, and the only way to use it is to collapse it to a boolean first.

> (1&2) + (10^20) < 15
all(one(True, False), one(True, False))

# so coerces junctions to booleans
> so (1&2) + (10^20) < 15
True

> so (1&2) + (10^20) > 0
False

> 16 %% (3&5) ?? "fizzbuzz" !! *
*

The real interesting thing for me is how Raku elegantly uses junctions to represent quantifiers. In most languages, you either have the function all(list[T], T -> bool) or the method [T].all(T -> bool), both of which apply the test to every element of the list. In Raku, though, list.all doesn't take anything, it's just a niladic method that turns the list into a junction.

> my $x = <1 2 3>.all
all(1, 2, 3)
> is-prime($x)
all(False, True, True)

This means we can combine junctions. If Raku didn't already have a unique method, we could build it by saying "are all elements equal to exactly one element?"

> so {.all == .one}(<1 2 3 7>)
True

> so {.all == .one}(<1 2 3 7 2>)
False

Whatevers

* is the "whatever" symbol and has a lot of different roles in Raku.2 Some functions and operators have special behavior when passed a *. In a range or sequence, * means "unbound".

> 1..*
1..Inf

> (2,4,8...*)[17]
262144

The main built-in use, though, is that expressions with * are lifted into anonymous functions. This is called "whatever-priming" and produces a WhateverCode, which is indistinguishable from other functions except for the type.

> {$_ + 10}(2)
12

> (* + 10)(2)
12

> (^10).map(* % 2)
(0 1 0 1 0 1 0 1 0 1)

There's actually a bit of weird behavior here: if two whatevers appear in the expression, they become separate positional variables. (2, 30, 4, 50).map(* + *) returns (32, 54). This makes it easy to express a tricky Fibonacci definition but otherwise I don't see how it's better than making each * the same value.

Regardless, priming is useful because so many Raku methods are overloaded to take functions. You get the last element of a list with l[*-1]. This looks like standard negative-index syntax, but what actually happens is that when [] is passed a function, it passes in list length and looks up the result. So if the list has 10 elements, l[*-1] = l[10-1] = l[9], aka the last element. Similarly, l.head(2) is the first two elements of a list, l.head(*-2) is all-but-the-last-two.

We can pass other functions to [], which e.g. makes implementing ring buffers easy.

> my @x = ^10
[0 1 2 3 4 5 6 7 8 9]

> @x[95 % *]--; @x
[0 1 2 3 4 4 6 7 8 9]

Regular Expressions

There are two basic standards for regexes: POSIX regexes and Perl-compatible regexes (PCRE). POSIX regexes are a terrible mess of backslashes and punctuation. PCRE is backwards compatible with POSIX and is a more terrible mess of backslashes and punctuation. Most languages follow the PCRE standard, but Perl 6 breaks backwards compatibility with an entirely new regex syntax.

The most obvious improvement: composability. In most languages "combine" two regexes by concating their strings together, which is terrible for many, many reasons. Raku has the standard "embed another regex" syntax: /< foo >+/ matches one-or-more of the foo regex without foo "leaking" into the top regex.

This already does a lot to make regexes more tractable: you can break a complicated regular expression down into simpler and more legible parts. And in fact this is how Raku supports parsing grammars as a builtin language feature. I've only used grammars once but it was quite helpful.

Since we're breaking backwards compatibility anyway, we can now add lots of small QOLs. There's a value separator modifier: \d+ % ',' matches 1 / 1,2 / 1,1,4 but not 1, or 12. Lookaheads and non-capturing groups aren't nonsense glyphs. r1 && r2 only matches strings that match both r1 and r2. Backtracking can be stopped with :. Whitespace is ignored by default and has to be explicitly enabled in match patterns.

There's more stuff Raku does with actually processing regular expressions, but the regex notation is something that might actually appear in another language someday.

Hyperoperators

This is a small one compared to the other features, but it's also the thing I miss most often in other languages. The most basic form l>>.method is basically equivalent to map, except it also recursively descends into sublists.

> [1, [2, 3], 4]>>.succ
[2 [3 4] 5]

This is more useful than it looks because any function call f(list, *args) can be rewritten in "method form" list.&f(*args), so >>. becomes the generalized mapping operator. You can use it with whatevers, too.

> [1, [2, 3], 4]>>.&(*+1)
[2 [3 4] 5]

Anyway, the more generalized binary hyperoperator l1 << op >> l23 applies op elementwise to the two lists, looping the shorter list until the longer list is exhausted. >>op>> / << op<< are the same except they instead loop until the lhs/rhs list is exhausted. Whew!

> [1, 2, 3, 4, 5] <<+>> [10, 20]
[11 22 13 24 15]

> [1, 2, 3, 4, 5] <<+<< [10, 20]
[11 22]

> [1, 2, 3, 4, 5] >>+>> [10, 20]
[11 22 13 24 15]

# Also works with single values
> [1, 2, 3, 4, 5] <<+>> 10
[11 12 13 14 15]

# Does weird things with nested lists too
> [1, [2, 3], 4, 5] <<+>> [10, 20]
[11 [22 23] 14 25]

Also for some reason the hyperoperators have separate behaviors on two hashes, either applying op to the union/intersection/hash difference.

Anyway it's a super weird (meta)operator but it's also quite useful! It's the closest thing I've seen to J verbs outside an APL. I like using it to run the same formula on multiple possible inputs at once.

(20 * 10 <<->> (21, 24)) <<*>> (10, 100)
(1790 17600)

Incidentally, it's called the hyperoperator because it evaluates all of the operations in parallel. Explicit loops can be parallelized by prefixing them with hyper.

Pair Syntax

I've talked about pairs a little in this newsletter, but the gist is that Raku hashes are composed of a set of pairs key => value. The pair is the basis type, the hash is the collection of pairs. There's also a ton of syntactic sugar for concisely specifying pairs via "colon syntax":

> my $x = 3; :$x
x => 3

> :a<$x>
a => "$x"

> :a($x)
a => 3

> :3a
a => 3

The most important sugars are :key and :!key, which map to key => True and key => False. This is a really elegant way to add flags to a methods! Take the definition of match:

method match($pat, 
    :continue(:$c), :pos(:$p), :global(:$g), 
    :overlap(:$ov), :exhaustive(:$ex), 
    :st(:$nd), :rd(:$th), :$nth, :$x --> Match)

Probably should also mention that in a definition, :f(:$foo) defines the parameter $foo but also aliases it to :f, so you can set the flag with :f or :foo. Colon-pairs defined in the signature can be passed in anywhere, or even stuck together:

> "abab".match(/../)
「ab」
> "abab".match(/../, :g)
(「ab」 「ab」)
> "abab".match(/../, :g, :ov)
(「ab」 「ba」 「ab」)

# Out of order stuck together
> "abab".match(:g:ov, /../)
(「ab」 「ba」 「ab」)

So that leads to extremely concise method configuration. Definitely beats match(global=True, overlap=True)!

And for some reason you can place keyword arguments after the function call:

> "abab".match(:g, /../):ov:2ndba

The next-gen lab: Slangs and RakuAST

These are features I have no experience in and certainly are not making their way into other languages, but they really expand the explorable space of new features. Slangs are modifications to the Raku syntax. This can be used for things like modifying loop syntax, changing identifiers, or adding actors or DNA sequences to the base language.

I barely understand RakuAST. I think the idea is that all Raku expressions can be parsed as an AST from inside Raku itself.

> Q/my $x; $x++/.AST
RakuAST::StatementList.new(
  RakuAST::Statement::Expression.new(
    expression => RakuAST::VarDeclaration::Simple.new(
      sigil       => "\$",
      desigilname => RakuAST::Name.from-identifier("x")
    )
  ),
  RakuAST::Statement::Expression.new(
    expression => RakuAST::ApplyPostfix.new(
      operand => RakuAST::Var::Lexical.new("\$x"),
      postfix => RakuAST::Postfix.new("++")
    )
  )
)

This allows for things like writing Raku in different languages:

say Q/my $x; put $x/.AST.DEPARSE("NL")
mijn $x;
zeg-het $x

Bonus experiment

Raku comes with a "Rakudo Star" installation, which comes with a set of blessed third party modules preinstalled. I love this! It's a great compromise between the maintainer burdens of a large standard library and the user burdens of making everybody find the right packages in the ecosystem.


Blog Rec

Feel obligated to recommend some Raku blogs! Elizabeth Mattijsen posts a ton of stuff to dev.to about Raku internals. Codesections has a pretty good blog; he's the person who eventually got me to try out Raku. Finally, the Raku Advent Calendar is a great dive into advanced Raku techniques. Bad news is it only updates once a year, good news is it's 25 updates that once a year.


    • All techniques chapters now have a "Further Reading" section
    • "System modeling" chapter significantly rewritten
    • "Conditionals" chapter expanded, now a real chapter
    • "Logic Programming" chapter now covers datalog, deductive databases
    • "Solvers" chapter has diagram explaining problem
    • Eight new exercises
    • Tentative front cover (will probably change)
    • Fixed some epub issues with math rendering

  1. Analogues are Scala's underscore, except unlike Scala it's a value and not syntax, and like Python's Ellipses, except it has additional semantics. 

  2. Spaces added so buttondown doesn't think they're tags 

A list of ternary operators

5 November 2024 at 18:40

Sup nerds, I'm back from SREcon! I had a blast, despite knowing nothing about site reliability engineering and being way over my head in half the talks. I'm trying to catch up on The Book and contract work now so I'll do something silly here: ternary operators.

Almost all operations on values in programming languages fall into one of three buckets:

  1. Unary operators, where the operator goes before or after exactly one argument. Examples are x++ and -y and !bool. Most languages have a few critical unary operators hardcoded into the grammar. They are almost always symbols, but sometimes are string-identifiers (not).
  2. Binary operators, which are placed between exactly two arguments. Things like + or && or >=. Languages have a lot more of these than unary operators, because there's more fundamental things we want to do with two values than one value. These can be symbols or identifiers (and).
  3. Functions/methods that prefix any number of arguments. func(a, b, c), obj.method(a, b, c, d), anything in a lisp. These are how we extend the language, and they almost-exclusively use identifiers and not symbols.1

There's one widespread exception to this categorization: the ternary operator bool ? x : y.2 It's an infix operator that takes exactly three arguments and can't be decomposed into two sequential binary operators. bool ? x makes no sense on its own, nor does x : y.

Other ternary operators are extremely rare, which is why conditional expressions got to monopolize the name "ternary". But I like how exceptional they are and want to compile some of them. A long long time ago I asked Twitter for other ternary operators; this is a compilation of some applicable responses plus my own research.

(Most of these are a bit of a stretch.)

Stepped Ranges

Many languages have some kind of "stepped range" function:

# Python
>>> list(range(1, 10, 2))
[1, 3, 5, 7, 9]

There's the "base case" of start and endpoints, and an optional step. Many languages have a binary infix op for the base case, but a few also have a ternary for the optional step:

# Frink
> map[{|a| a*2}, (1 to 100 step 15) ] 
[2, 32, 62, 92, 122, 152, 182]

# Elixir
> IO.puts Enum.join(1..10//2, " ")
1 3 5 7 9

This isn't decomposable into two binary ops because you can't assign the range to a value and then step the value later.

Graph ops

In Graphviz, a basic edge between two nodes is either the binary node1 -> node2 or the ternary node1 -> node2 [edge_props]:

digraph G {
  a1 -> a2 [color="green"]
}

Output of the above graphviz

Graphs seem ternary-friendly because there are three elements involved with any graph connection: the two nodes and the connecting edge. So you also see ternaries in some graph database query languages, with separate places to specify each node and the edge.

# GSQL (https://docs.tigergraph.com/gsql-ref/4.1/tutorials/gsql-101/parameterized-gsql-query)
SELECT tgt
    FROM start:s -(Friendship:e)- Person:tgt;

# Cypher (https://neo4j.com/docs/cypher-manual/current/introduction/cypher-overview/)
MATCH (actor:Actor)-[:ACTED_IN]->(movie:Movie {title: 'The Matrix'})

Obligatory plug for my graph datatype essay.

Metaoperators

Both Raku and J have special higher-order functions that apply to binary infixes. Raku calls them metaoperators, while J calls them adverbs and conjugations.

# Raku

# `a «op» b` is map, "cycling" shorter list
say <10 20 30> «+» <4 5>
(14 25 34)

# `a Rop b` is `b op a`
say 2 R- 3
1
NB. J

NB. x f/ y creates a "table" of x f y
   1 2 +/ 10 20
11 21
12 22

The Raku metaoperators are closer to what I'm looking for, since I don't think you can assign the "created operator" directly to a callable variable. J lets you, though!

   h =: +/
   1 2 h 3 4
4 5
5 6

That said, J has some "decomposable" ternaries that feel spiritually like ternaries, like amend and fold. It also has a special ternary-ish contruct called the "fork".3 x (f g h) y is parsed as (x f y) g (x h y):

NB. Max - min
   5 (>. - <.) 2
3
   2 (>. - <.) 5
3

So at the top level that's just a binary operator, but the binary op is constructed via a ternary op. That's pretty cool IMO.

Assignment Ternaries

Bob Nystrom points out that in many languages, a[b] = c is a ternary operation: it is not the same as x = a[b]; x = c.

A weirder case shows up in Noulith and Raku (again): update operators. Most languages have the += binary operator, these two have the f= ternary operator. a f= b is the same as a = f(a, b).

# Raku
> my $x = 2; $x max= 3; say $x
3

Arguably this is just syntactic sugar, but I don't think it's decomposable into binary operations.

Custom user ternaries

Tikhon Jelvis pointed out that Agda lets you define custom mixfix operators, which can be ternary or even tetranary or pentanary. I later found out that Racket has this, too. Objective-C looks like this, too, but feels different somehow.

Near Misses

All of these are arguable, I've just got to draw a line in the sand somewhere.

  • Regular expression substitutions: s/from/to/flags seems like a ternary, but I'd argue it a datatype constructor, not an expression operator.
  • Comprehensions like [x + 1 | x <- list]: looks like the ternary [expr1 | expr2 <- expr3], but expr2 is only binding a name. Arguably a ternary if you can map and filter in the same expression a la Python or Haskell, but should that be considered sugar for
  • Python's operator chaining (1 < x < 5): syntactic sugar for 1 < x and x < 5.
  • Someone suggested glsl swizzles, which are very cool but binary operators.

Why are ternaries so rare?

Ternaries are somewhat more common in math and physics, f.ex in integrals and sums. That's because they were historically done on paper, where you have a 2D canvas, so you can do stuff like this easily:

10
Σ    n
n=0

We express the ternary by putting arguments above and below the operator. All mainstream programming languages are linear, though, so any given symbol has only two sides. Plus functions are more regular and universal than infix operators so you might as well write Sum(n=0, 10, n). The conditional ternary slips through purely because it's just so darn useful. Though now I'm wondering where it comes from in the first place. Different newsletter, maybe.

But I still find ternary operators super interesting, please let me know if you know any I haven't covered!


Blog Rec

This week's blog rec is Alexis King! Generally, Alexis's work spans the theory, practice, and implementation of programming languages, aimed at a popular audience and not an academic one. If you know her for one thing, it's probably Parse, don't validate, which is now so mainstream most people haven't read the original post. Another good one is about modeling open-world systems with static types.

Nowadays she is far more active on Programming Languages Stack Exchange, where she has blog-length answers on reading type notations, compiler design, and why arrows.


  1. Unless it's a lisp. 

  2. Or x if bool else y, same thing. 

  3. I say "ish" because trains can be arbitrarily long: x (f1 f2 f3 f4 f5) y is something I have no idea how to parse

TLA from first principles

22 October 2024 at 17:14

No Newsletter next week

I'll be speaking at USENIX SRECon!

TLA from first principles

I'm working on v0.5 of Logic for Programmers. In the process of revising the "System Modeling" chapter, I stumbled on a great way to explain the Temporal Logic of Actions that TLA+ is based on. I'm reproducing that bit here with some changes to fit the newsletter format.

Note that by this point the reader has already encountered property testing, formal verification, decision tables, and nontemporal specifications, and should already have a lot of practice expressing things as predicates.


The intro

We have some bank users, each with an account balance. Bank users can wire money to each other. We have overdraft protection, so wires cannot reduce an account value below zero.

For the purposes of introducing the ideas, we'll assume an extremely simple system: two hardcoded variables alice and bob, both start with 10 dollars, and transfers are only from Alice to Bob. Also, the transfer is totally atomic: we check for adequate funds, withdraw, and deposit all in a single moment of time. Later [in the chapter] we'll allow for multiple nonatomic transfers at the same time.

First, let's look at a valid behavior of the system, or possible way it can evolve.

alice   10 ->  5 -> 3  -> 3  -> ...
bob     10 -> 15 -> 17 -> 17 -> ...

In programming, we'd think of alice and bob as variables that change. How do we represent those variables purely in terms of predicate logic? One way is to instead think of them as arrays of values. alice[0] is the initial state of alice, alice[1] is after the first time step, etc. Time, then, is "just" the set of natural numbers.

Time  = {0, 1, 2, 3, ...}
alice = [10, 5, 3, 3, ...]
bob   = [10, 15, 17, 17, ...]

In comparison to our valid behavior, here are some invalid behaviors:

alice = [10, 3,  ...]
bob   = [10  15, ...]

alice = [10, -1,  ...]
bob   = [10  21,  ...]

The first is invalid because Bob received more money than Alice lost. The second is invalid because it violates our proposed invariant, that accounts cannot go negative. Can we write a predicate that is true for valid transitions and false for our two invalid behaviors?

Here's one way:

Time = Nat // {0, 1, 2, etc}

Transfer(t: Time) =
  some value in 0..=alice[t]:
    1. alice[t+1] = alice[t] - value
    2. bob[t+1] = bob[t] + value

Go through and check that this is true for every t in the valid behavior and false for at least one t in the invalid behavior. Note that the steps where Alice doesn't send a transfer also pass Transfer; we just pick value = 0.

I can now write a predicate that perfectly describes a valid behavior:

Spec = 
  1. alice[0] = 10
  2. bob[0]   = 10
  3. all t in Time:
    Transfer(t)

Now allowing "nothing happens" as "Alice sends an empty transfer" is a little bit weird. In the real system, we probably don't want people to constantly be sending each other zero dollars:

Transfer(t: Time) =
- some value in 0..=alice[t]:
+ some value in 1..=alice[t]:
    1. alice[t+1] = alice[t] - value
    2. bob[t+1] = bob[t] + value

But now there can't be a timestep where nothing happens. And that means no behavior is valid! At every step, Alice must transfer at least one dollar to Bob. Eventually there is some t where alice[t] = 0 && bob[t] = 20. Then Alice can't make a transfer, Transfer(t) is false, and so Spec is false.1

So typically when modeling we add a stutter step, like this:

Spec =
  1. alice[0] = 10
  2. bob[0]   = 10
  3. all t in Time:
    || Transfer(t)
    || 1. alice[t+1] = alice[t]
       2. bob[t+1] = bob[t]

(This is also why we can use infinite behaviors to model a finite algorithm. If the algorithm completes at t=21, t=22,23,24... are all stutter steps.)

There's enough moving parts here that I'd want to break it into subpredicates.

Init =
  1. alice[0] = 10
  2. bob[0]   = 10

Stutter(t) =
  1. alice[t+1] = alice[t]
  2. bob[t+1] = bob[t]

Next(t) = Transfer(t) // foreshadowing

Spec =
  1. Init
  2. all t in Time:
    Next(t) || Stutter(t)

Now finally, how do we represent the property NoOverdrafts? It's an invariant that has to be true at all times. So we do the same thing we did in Spec, write a predicate over all times.

property NoOverdrafts =
  all t in Time:
    alice[t] >= 0 && bob[t] >= 0

We can even say that Spec => NoOverdrafts, ie if a behavior is valid under Spec, it satisfies NoOverdrafts.

One of the exercises

Modify the Next so that Bob can send Alice transfers, too. Don't try to be too clever, just do this in the most direct way possible.

Bonus: can Alice and Bob transfer to each other in the same step?

Solution [in back of book]: We can rename Transfer(t) to TransferAliceToBob(t), write the converse as a new predicate, and then add it to next. Like this

TransferBobToAlice(t: Time) =
  some value in 1..=bob[t]:
    1. alice[t+1] = alice[t] - value
    2. bob[t+1] = bob[t] + value

Next(t) =
  || TransferAliceToBob(t)
  || TransferBobToAlice(t)

Now, can Alice and Bob transfer to each other in the same step? No. Let's say they both start with 10 dollars and each try to transfer five dollars to each other. By TransferAliceToBob we have:

1. alice[1] = alice[0] - 5 = 5
2. bob[1] = bob[0] + 5 = 15

And by TransferBobToAlice, we have:

1. bob[1] = bob[0] - 5 = 5
2. alice[1] = alice[0] + 5 = 15

So now we have alice[1] = 5 && alice[1] = 15, which is always false.

Temporal Logic

This is good and all, but in practice, there's two downsides to treating time as a set we can quantify over:

  1. It's cumbersome. We have to write var[t] and var[t+1] all over the place.
  2. It's too powerful. We can write expressions like alice[t^2-5] = alice[t] + t.

Problem (2) might seem like a good thing; isn't the whole point of logic to be expressive? But we have a long-term goal in mind: getting a computer to check our formal specification. We need to limit the expressivity of our model so that we can make it checkable.

In practice, this will mean making time implicit to our model, instead of explicitly quantifying over it.

The first thing we need to do is limit how we can use time. At a given point in time, all we can look at is the current value of a variable (var[t]) and the next value (var[t+1]). No var[t+16] or var[t-1] or anything else complicated.

And it turns out we've already seen a mathematical convention for expressing this: priming!2 For a given time t, we can define var to mean var[t] and var' to mean var[t+1]. Then Transfer(t) becomes

Transfer =
  some value in 1..=alice:
    1. alice' = alice
    2. bob' = bob

Next we have the construct all t in Time: P(t) in both Spec and NoOverdrafts. In other words, "P is always true". So we can add always as a new term. Logicians conventionally use □ or [] to mean the same thing.3

property NoOverdrafts =
  always (alice >= 0 && bob[t] >= 0)
  // or [](alice >= 0 && bob[t] >= 0)

Spec =
  Init && always (Next || Stutter)

Now time is almost completely implicit in our spec, with just one exception: Init has alice[0] and bob[0]. We just need one more convention: if a variable is referenced outside of the scope of a temporal operator, it means var[0]. Since Init is outside of the [], it becomes

Init =
  1. alice = 10
  2. bob = 10

And with that, we've removed Time as an explicit value in our model.

The addition of primes and always makes this a temporal logic, one that can model how things change over time. And that makes it ideal for modeling software systems.

Modeling with TLA+

One of the most popular specification languages for modeling these kinds of concurrent systems is TLA+. TLA+ was invented by the Turing award-winner Leslie Lamport, who also invented a wide variety of concurrency algorithms and LaTeX. Here's our current spec in TLA+:

---- MODULE transfers ----
EXTENDS Integers

VARIABLES alice, bob
vars == <>

Init ==
  alice = 10 
  /\ bob = 10

AliceToBob ==
  \E amnt \in 1..alice:
    alice' = alice - amnt
    /\ bob' = bob + amnt

BobToAlice ==
  \E amnt \in 1..bob:
    alice' = alice + amnt
    /\ bob' = bob - amnt

Next ==
  AliceToBob
  \/ BobToAlice

Spec == Init /\ [][Next]_vars

NoOverdrafts ==
  [](alice >= 0 /\ bob >= 0)

====

TLA+ uses ASCII versions of mathematicians notation: /\/\/ for &&/||, \A \E for all/some, etc. The only thing that's "unusual" (besides == for definition) is the [][Next]_vars bit. That's TLA+ notation for [](Next || Stutter): Next or Stutter always happens.


The rest of the chapter goes on to explain model checking, PlusCal (for modeling nonatomic transactions without needing to explain the exotic TLA+ function syntax), and liveness properties. But this is the intuition behind the "temporal logic of actions": temporal operators are operations on the set of points of time, and we restrict what we can do with those operators to make reasoning about the specification feasible.

Honestly I like it enough that I'm thinking of redesigning my TLA+ workshop to start with this explanation. Then again, maybe it only seems good to me because I already know TLA+. Please let me know what you think about it!

Anyway, the new version of the chapter will be in v0.5, which should be out mid-November.


Blog Rec

This one it's really dear to me: Metadata, by Murat Demirbas. When I was first trying to learn TLA+ back in 2016, his post on using TLA+ in a distributed systems class was one of, like... three public posts on TLA+. I must have spent hours rereading that post and puzzling out this weird language I stumbled into. Later I emailed Murat with some questions and he was super nice in answering them. Don't think I would have ever grokked TLA+ without him.

In addition to TLA+ content, a lot of the blog is also breakdowns of papers he read— like the morning paper, except with a focus on distributed systems (and still active). If you're interested in learning more about the science of distributed systems, he has an excellent page on foundational distributed systems papers. But definitely check out his his deep readings, too!


  1. In the book this is presented as an exercise (with the solution in back). The exercise also clarifies that since Time = Nat, all behaviors have an infinite number of steps. 

  2. Priming is introduced in the chapter on decision tables, and again in the chapter on database invariants. x' is "the next value of x", so you can use it to express database invariants like "jobs only move from ready to started or aborted." 

  3. I'm still vacillating on whether I want a "beyond logic" appendix that covers higher order logic, constructive logic, and modal logic (which is what we're sneakily doing right now!)

    While I'm here, this explanation of always as all t in Time isn't 100% accurate, since it doesn't explain why things like [](P => []Q) or <>[]P make sense. But it's accurate in most cases and is a great intuition pump. 

My patented Miracle Tonic would have prevented the CrowdStrike meltdown

23 July 2024 at 15:22

Last Friday CrowdStrike did something really bad and it destroyed every airport in the world. I didn't bother to learn anything else about it because I was too busy writing my 10k whitepaper about how all the problems were all caused by one simple mistake: not drinking my patented Miracle Tonic™®.

Developers who drink my Miracle Tonic write code 100% faster with 100% fewer bugs. This would have prevented the CrowdStrike outage, the 2016 DNC hack, Ariane 5, Therac-25, and that one moth caught in the Harvard Mark II. Developers are also happier at work, suffer no burnout, and keep all standups to exactly five minutes.

The Miracle Tonic is so effective that it should be immoral not to drink it. It's like if surgeons didn't wash their hands. If you write code for a living and you don't drink my Miracle Tonic, do us all a favor and never touch a computer again. You idiot. You absolute moron. I can't believe you call yourself a "professional".

Frequently Asked Questions:

Do you have any actual evidence that Miracle Tonic actually helps programmers?

Yes I do! All sorts of studies prove the effectiveness of Miracle Tonic:

  1. One survey found that 68% of devs drinking miracle tonic say their code is "awesome". That means it works!
  2. A double-blind clinical trial found that 7 undergraduates who were given Miracle Tonic wrote 10% better code (using the Hills-Bourne-Davidio Corrected Dolorimetry metric) than the control group (6 undergraduates who were punched in the face).
  3. Someone found twelve projects on GitHub that didn't use Miracle Tonic and they had 268% worse failure rates than this one project I did with it. That's a P value of 0.00004!

That's so many studies! I can say with 100% confidence that Miracle Tonic is proven to work beyond a shadow of a doubt.

I read a study saying that Miracle Tonic gives people headaches.

Why are you trusting studies? Are you really gonna listen to some graduate student dweeb who's never been a real programmer?! If you're curious about Miracle Tonic, just try it and see for yourself.

Are there any downsides to drinking Miracle Tonic?

Of course, there is no silver bullet, everything is tradeoffs, etc etc etc. The downside of Miracle Tonic is it doesn't work if your company is a toxic feature factory that cares more about micromanaging mindless worker drones than cultivating good engineers. And don't drink it if you're allergic to peanuts.

This tastes revolting.

I've trained five Miracle Tonic Brand Ambassadors and they all tell me it's delicious. Your taste buds must be wrong.

I tried drinking your Miracle Tonic and it didn't make me a better programmer.

How dare you. How dare you spread FUD about the most important programming technology ever made. Were you drinking exactly 12.063 mL at 67° C every 75 minutes? Yeah, thought not. Of course it's not going to work if you didn't follow it properly. And you'd know this if you took my $1500 three-day "how do drink Miracle Tonic" course. Which you didn't. Get out of my sight.

How does Miracle Tonic compare to competing products, like toad oil, Pirelli's Miracle Elixir, or Design-by-Contract?

Fucking charlatans, all of them.


This is the part of my job I dread.

I do formal methods for a living. Last year 100% of my income came from either writing specifications or training others how to write specifications. This year, due to a lot of work in diversifying my revenue streams, it will only be 90% of my income. This creates an immense pressure to become an ambulance chaser, to see ten billion dollars in economic damage as a chance to say "look at me!", a longshot chance to find the next source who'll pay my next month's rent.

I'm also a True Believer. Of course formal methods would have prevented the CrowdStrike outage! It also would have prevented COVID-19, the Cuban Missile Crisis and the Lincoln assassination. Years of being an advocate have shaped my worldview to always see how it could have helped, how everything is just the right shape nail for the hammer I'm selling you.

None of this depends on anything particular to formal methods, which is why advocates of every stripe are telling us "what they should have done", because every advocate is a true believer, and every advocate wants to get paid. I understand this all, and I get why people do this, but I hate feeling this way, that all this misery begets opportunity. I hate the pressure to say "they needed X" as fast as possible, before people lose interest, regardless of whether X would help or even if they were already using X. I hate the feeling that capitalizing on this might compromise my principles.

Most of all though, I fear the slippery slope from advocating to grifting. There are people out there who saw the outage and got excited. Please, God, let me never be that.

Keep perfecting your config

16 July 2024 at 19:18

First of all, I wanted to extend a huge and heartfelt thank you to all of the people who bought Logic for Programmers. Seeing the interest is incredible motivation to continue improving it. If you read it and have feedback, please share it with me!

Second, I have a new blogpost out: Toolbox Languages. It's about five obscure languages I use that make certain hard problems I have much easier. I hope it encourages other people to talk about their toolbox languages (or share new ones with me!). Patreon blog notes (and the cut sixth language) here.

Third, actual newsletter.

Keep perfecting your config

Last week I read the article stop perfecting your config, which argues that people spend too much time tinkering with "editors, IDEs, terminals [and] other applications". I firmly believe the opposite, that people should spend more time tinkering. To be fair, the OP says

Before I continue, let’s make one thing clear. By no means am I saying that there is something wrong with trying to master your tools or attempting to adjust them to suit your specific needs. It’s actually the opposite.

So I read this more as "don't make my mistakes" than "don't do what I don't like."1 At the same time, it's notable are "his mistakes"— what he counted as perfecting his configs. Reading the article, he mentions Neovim keymaps once, plugins once, and color schemes four times.

There's an idea in programming language design called Wadler's Law, which is that the majority of discussion is on syntax and not semantics. I imagine this applies to configuration too: the majority of time spent "tinkering" is on layout and coloring and not on changes to the tool's behavior. This is a trap. You get the most value of "configuration" when you use it to improve your workflow.

Here's a Neovim script I wrote a while back:

function LoadLocal(local_task)
  vim.b.local_task = local_task
end

function RunLocal()
  vim.cmd(vim.b.local_task)
end

vim.cmd [[command! -nargs=1 LoadLocal call v:lua.LoadLocal(<f-args>)]]
vim.keymap.set('n', 'gxl', RunLocal, {desc="Buffer Task"})

On calling :LoadLocal foo it loads foo as the local command for that specific buffer. Then, typing gxl executes foo. The command can be anything, up to and including shelling out to !rm *.

Oh, also any other script can set vim.b.local_task.2 LoadLocal can be used by the rest of my configs. When I open an XML file it sets the local task to the appropriate transformation script.

Those eight lines have probably saved me at least eight hours of drudgework over their lifespan. The full taskrunner has saved even more. That's the kind of customization that's most useful.

Anything that goes for an IDE goes double for the shell. I use a ton of shell scripts.

Customization teaches you

A common argument against customization is that you shouldn't do it until you know the tooling basics well. For example, you shouldn't install a filetree plugin for neovim because it comes with Netrw. In other words, there's a difference between the skill of using your tools and the skill of customizing them, and the first is more important than the second.

But just as often customization promotes mastery, by smoothing out the starting friction with using a feature. Say you're trying to get more comfortable with vim macros. Here's a keymapping that could help:

vim.keymap.set('n', "<leader>Q", [[:let @q = input("Edit macro:", @q)<CR>]])

Vim macros are stored in a text registry: recording the @q macro will overwrite whatever you copied to the "q register. This means that you can treat a macro as text. input will fill a prompt with the @q macro text, let you edit it as text, and then save your changes back to the register.

This makes it so much easier to experiment with macros! If you screw up a macro while recording, you don't need to start over, you can just finish and make an edit afterwards. If you try a finished macro and it does something wrong, same thing, just undo and edit it.

Reasons not to customize

I've heard two other reasons why not to heavily customize your system:

  1. It'll be harder for other people to pair with you on your machine
  2. It'll be harder for you to work on machines without the customizations: servers you SSHed into, VMs, other people's computers, etc.

(1) is only a problem if you're modifying default keybindings or adding surprising new ones. I deal with it by having two editors: a heavily customized Neovim for solo work, a vanilla VSCode (plus plugins) for collaboration.

(2) is a bigger problem. I find the problem is how it messes with muscle memory. In vim I've mapped H/L to beginning/end of line when normally they mean "first/last line in view", and it always screws me up. It is so immensely frustrating. Other than that, though, I generally can tolerate losing my config for a bit. Most of the stuff that's most useful to me are things that aren't useful when I'm remoting, so I don't miss them.

One good reason not to customize that I haven't heard: There is one more reason I've heard not to customize: you can easily break stuff that you don't know how to fix. This is more true with stuff like vim and shell that allow arbitrary scripts as config, less so with VSCode or the new crop of modal editors.


  1. Also he wrote positively about configuration later

  2. I swear to God if this leads to someone targeting an exploit at me I will quit programming forever 

Logic for Programmers now in early access!

8 July 2024 at 17:33

I am delighted to announce that Logic for Programmers is now available for purchase! While still in early access, it's almost 20,000 words, has 30 exercises, and covers a wide variety of logic applications:

  • Property testing
  • Functional correctness and contracts
  • Formal verification
  • Decision tables
  • Database constraints
  • Data modeling and alloy
  • Constraint solving
  • Logic programming

I want to emphasize the book is early access; meaning it's not close to done yet. I'm releasing it now so I can get feedback from readers and use that to decide what I write next. But I'm only willing to release it because I think, even in its current state, it's worth reading. I'm just promising it'll be even better in a few months.

What's next

I think exercise solutions are broken on epub, so step one is getting that fixed.

After that, I'm taking a short break. I put basically all of June into working on this and need to get back to blogging and client acquisition. Plus I need a bit of time to gather reader feedback.

I'm thinking of new major releases being either every two weeks or every month or so. I'll upload new versions of the book between releases and then send a Leanpub publication announcement when we hit the next milestone.

PS

There's a free book coupon on the Patreon. Enough people have joined that I feel obligated to start posting more original content there why does this keep happening to me

Solving a math problem with planner programming

2 July 2024 at 15:46

The deadline for the logic book is coming up! I'm hoping to have it ready for early access by either the end of this week or early next week. During a break on Monday I saw this interesting problem on Math Stack Exchange:

Suppose that at the beginning there is a blank document, and a letter "a" is written in it. In the following steps, only the three functions of "select all", "copy" and "paste" can be used.

Find the minimum number of steps to reach at least 100,000 a's (each of the three operations of "select all", "copy" and "paste" is counted as one step). If the target number is not specified, and I want to get the exact amount of a, is there a general formula?

The first two answers look for analytic solutions. The last answer shares a C++ program that finds it via breadth-first search. I'll reproduce it here:

#include <iostream>
#include <queue>

enum Mode
{
    SELECT,
    COPY,
    PASTE
};

struct Node
{
    int noOfAs;
    int steps;
    int noOfAsCopied;
    Mode mode;
};

int main()
{
    std::queue<Node> q;

    q.push({1, 0, 0, SELECT});

    while (!q.empty())
    {
        Node n = q.front();
        q.pop();

        if (n.noOfAs >= 100000)
        {
            std::cout << n.steps << std::endl;
            break;
        }

        switch (n.mode)
        {
        case SELECT:
            q.push({n.noOfAs, n.steps + 1, n.noOfAsCopied, COPY});
            break;
        case COPY:
            q.push({n.noOfAs, n.steps + 1, n.noOfAs, PASTE});
            break;
        case PASTE:
            q.push({n.noOfAs, n.steps, n.noOfAsCopied, SELECT});
            q.push({n.noOfAs + n.noOfAsCopied, n.steps + 1, n.noOfAsCopied, PASTE});
            break;
        }
    }

    return 0;
}

This is guaranteed to find a shortest possible solution due to a fun property of BFS: the distance of nodes to the origin never decreases. If you evaluate Node Y after Node X, then Y.dist >= X.dist, meaning that the first valid solution will be a shortest possible solution. I should make this into a logic book example!

This also has the drawback of preventing the use of an insight. We should be able to fuse the select and copy steps together, meaning instead of having three actions (select, copy, paste) we only need two (selectcopy, paste), where selectcopy takes twice as many steps as pasting.

But we can't make that optimization because it breaks monotonicity. We're now pushing a mix of n+1 and n+2 steps onto the queue, and there's no way to order things to guarantee all of the n+1 steps are searched before any n+2 step.

I thought I'd try to solve it with planning language instead, so we can get both the elegant solution and the optimization.

Planning

The rough idea of planning is that you provide an initial state, a set of actions, and a target, and the tool finds the shortest sequence of actions that reaches the target. I've written about it in-depth here and also a comparison of planning to model checking here. I like how some tough problems in imperative and functional paradigms become easy problems with planning.

This is all in Picat, by the way, which I've talked about more here and in the planning piece. I'll just be explaining the planning stuff specific to this problem.

import planner.
import util.

main =>
  Init = $state(1, 0) % one a, nothing copied
  , best_plan(Init, Plan, Cost)
  , nl
  , printf("Cost=%d%n", Cost)
  , printf("Plan=%s%n", join([P[1]: P in Plan], " "))
  .

We're storing the state of the system as two integers: the number of characters printed and the number of characters on our clipboard. Since we'll be fusing selects and copies, we don't need to also track the number of characters selected (unlike the C++).

final(state(A, _)) => A >= 100000.

action(state(A, Clipboard), To, Action, Cost) ?=>
  NewA = A + Clipboard
   , To = $state(NewA, Clipboard)
   , Action = {"P", To}
   , Cost = 1
   .

The paste action just adds the clipboard to the character count. Because Picat is a research language it's a little weird with putting expressions inside structures. If we did $state(1 + 1) it would store it as literally $state(1 + 1), not state(2).

Also you have to use dollar signs for definitions but no dollar signs for pattern matching inside a function definition. I have no idea why.

action(state(A, Clipboard), To, Action, Cost) ?=>
  To = $state(A, A)
   , Action = {"SC", To}
   , Cost = 2
   .

And that's it! That's the whole program. Running this gives us:

Cost=42
Plan=SC P P SC P P SC P P SC P P SC P P SC 
     P P SC P P SC P P SC P P P SC P P P

To find if there's a sequence that gets us exactly 100,000, we just need to make one change:

- final(state(A, _)) => A >= 100000.
+ final(state(A, _)) => A = 100000.

This returns a cost of 43.

On the other hand, I can't get it to find a path that makes exactly 100,001 characters, even with some optimizations. This is because the shortest path is over 9000 steps long! I haven't checked if the C++ BFS can find it.

Metaplanning

One reason planning fascinates me so much is that, if a problem is now easy, you can play around with it. Like if I wanted to add "delete a character" as a move, that's easy:

action(state(A, Clipboard), To, Action, Cost) ?=>
  A > 0
  , NewA = A - 1
  , To = $state(NewA, Clipboard)
  , Action = {"D", To}
  , Cost = 1
  .

This doesn't make exceeding or reaching 100,000 easier, but it makes reaching 100,001 take 47 steps instead of 9000.

With some tweaks, I can also ask questions like "what numbers does it make the most easier?" or "Do some numbers have more than one shortest path? Which number has the most?"

Planning is really cool.

A brief introduction to interval arithmetic

25 June 2024 at 16:08

You've got a wall in your apartment and a couch. You measure the wall with a ruler and get 7 feet, then you measure the couch and get 7 feet. Can you fit the couch against that wall?

Maybe. If the two measure is exactly 7 feet than sure, 7 ≤ 7. But you probably didn't line your measuring stick up perfectly with the couch, and also the stick itself might be a tiny bit long. There's some uncertainty around the edges, say 1/10th of a foot.

Instead of treating each as exactly 7 feet, we can instead say that each is somewhere between a minimum of 6.9 feet and a maximum of 7.1. We can write this as an interval (6.9, 7.1). From there, we can say interval (a, b) is "definitely less than" another interval (c, d) if b < c, ie the first interval ends before the second starts. Since 7.1 !< 6.9, we can't say for certain that the couch will fit against the wall. Your couch could actually be 7.1 feet and the wall 6.9 feet, or 7.05 and 7.01 feet, anything like that.

More arithmetic

When adding/multiplying/subtracting/dividing an interval by a scalar (single number), we just apply the op to both ends. I'll be using frink because it has a built-in interval type.

  i[a, b] := new interval [a, b]
  i[2, 4] + 2
[4, 6]

  i[2, 4] * 2
[4, 8]

Operating on two intervals is more complicated. The general principle is "pick a value from each interval that, when operating, give the minimum value, and do the same for the maximum value". If you measure two lengths as 3 ± 1 foot, then the length of both of them together must be 6 ± 2 feet. We can get this by just adding the minimums and maximums of the intervals together.

  i[1, 3] + i[2, 4]
[3, 7]

If we measure a rectangle's length as (5, 7) and it's width as (3, 4), what's the area? Multiplication seems like it'd be the same as addition, just multiply the two minimums and the two maximums together:

  i[5, 7] * i[3, 4]
[15, 28]

But this is breaks down when intervals cover negative numbers. If we just multiplied minimums and maximums, (-3, 3) * (-3, 3) would give us (9, 9). But we instead get the true minimum by picking -3 from the first interval and 3 from the second, which gives us -9. So the real interval is (-9, 9).

What about division? It works kinda like multiplication, except if the dividing interval spans 0, the result is undefined. Overall, interval arithmetic is like regular arithmetic except with more headaches.

Trouble in paradise

Now, how about (-3, 3)²? On first thought we say (-3, 3)² = (-3, 3)*(-3, 3), which gives us the same interval (-9, 9). But again, that's wrong. In (-3, 3)*(-3, 3), we pick one number from each interval and multiply them together, while in (-3, 3)^2 we pick one number from the interval and multiply it by itself. So the actual interval is (0, 9).

  i[-3, 3]^2
[0, 9]

But wait, doesn't that mean that x*x != x^2? Yes, it does. Err no, it doesn't. Maybe. It depends on what we mean by "x".

The two interpretations

  1. x is some value in the interval (-3, 3), but we don't know what it is. Then x*x is the same value both times, so it should be (0, 9).
  2. x is "smeared out" over the entire interval. Then x*x can be treated as two separate intervals, giving us (-9, 9).

Now usually people use intervals to mean [1], but most interval arithmetic systems I've seen do [2], or at least a mix of both with a bias towards [2].

There's two reasons why. First of all, you can reason about [2] locally. Consider the equation y = sin(x)*cos(x) where x is the interval (0, 2π).1 Figuring out y under [2] is easy: it's just (-1*1, 1*1) = (-1, 1). But this is impossible under [1], because sin is only ±1 when cos is 0! The actual interval under [1] is (-0.5, 0.5).

Notice that (-0.5, 0.5) is a subinterval of (-1, 1). That's the other reason to assume [2] in calculations: the result will always cover [1].

Nonetheless [2] leads to lots of weird behavior, like x*x != x^2. This is called overdetermination and leads to overestimation, where your interval bounds are too wide to be that useful to you. Frink tries to be smart about this, and will in some cases rewrite interval arithmetic to avoid overdetermination, even when you want the intervals to be different.

  i[-3, 3]*i[-3, 3]
[0, 9] 

This is all moot if we assign the same value interval to (-3, 3) to different variables, though, because then the interpretations of [1] and [2] don't have any differences.

  x = i[-3, 3]
  y = i[-3, 3]
  x*y
[-9, 9]

How this affects you

# Python time!
>>> x = 0.3
>>> x + 0.00000000000000001
0.3
>>> x + 0.00000000000000002
0.3
>>> x + 0.00000000000000003
0.30000000000000004

If a (double precision) floating point calculation lands in the interval (0.3 - 3e-17, 0.3 + 3e-17), it's collapsed to the value 0.3. This has lead to some proposals to use interval arithmetic instead of floating point. An interval may be less precise than a single FP number, but it will be more accurate: you won't have an exact number but you'll know that the correct number is somewhere in the bounds. While a bunch of FP computations will always give you an exact number, but it could be very far from the actual answer.

These proposals have been around as long as floating point. The father of the IEEE-754 standard, William Kahan, wrote an essay about why interval arithmetic can't replace FP. His argument comes back to overestimation: in most computations, interval boundaries grow too big to be useful.

He wrote that back in 2006. Since then, there's been one big new interval arithmetic proposal: John Gutafson's universal numbers. I think his posit version isn't intervals but I'm not sure? This is well outside of the my area of expertise. I can tell, though, that Gustafson and Kahan hate each other.

That's the main connection between intervals and computer science. Of course, there are also particular domains where intervals are really important, such as in manufacturing tolerances.


Logic for Programmers update

14k words! Writing a bunch of exercises and examples this week.


  1. Or (0, τ) for you tau lovers out there 

Logic for Programmers Update

21 June 2024 at 18:02

I spent the early week recovering and the later week working on Logic for Programmers ([init] [update]) because I have a self-imposed deadline of mid-July, backed up by a $1000 toxx clause. Here's where I currently am:

1: The book is now 14k words. About 4k are "basics", covering propositional logic, predicate logic, and set theory. Another 1500 words or so are things like the introduction, appendixes, and preliminary exercises. The rest is "techniques". I'm hoping for 20k words for the pre-alpha first draft.

(I shared an early version of the "basics" on my patreon.)

2: Based on some feedback from Saul Pwanson, I've organized each "technique" as a three parter: first a common programming situation, then how logic applies, then something cool that knowing logic enables. For example, testing → specifications of functions → property testing. The other techniques are

  • Functional correctness → contracts
  • Data invariants → stronger database constraints
  • Case coverage → decision tables
  • Conditionals → simplifying conditionals

The last "technique" is an overview of tools that you need some knowledge of logic to start using in the first place: constraint solving, logic programming, formal verification, formal specification. I'm trying to decide if-and-how to cover Z3 and SAT solving, or if there's anything else I'm missing.

2.5: Two techniques I haven't written about but think could be good material: state invariants, modeling requirements. Not sure about the best approaches to either. I'm also looking for other possible applications of logic in general. The key is that it can't just be something that logic helps you appreciate, logic has to help you do it better.

3: Beyond adding those new sections, I want to have more details of technique, more examples, and some exercises. I don't know the right proportion of them yet.

I wrote a special exercise directive: for html, it puts the solution in a dropdown, while for pdfs it puts it in the back of the book. I really like how Sphinx supports these kinds of extensions, though I wish the ecosystem was a little more robust (or the docs better).

4: I think that when the alpha version is out in July, the book is about 40% done. This is because I suspect that reader feedback will lead to a lot of revisions and improvements to the content, as well as ideas for new technique chapters. After that's incorporated, it's 80% done, and that's when I'll figure out nice typesetting and a good cover image.

5: There's now a leanpub page. I wanted to figure out the logistics of distribution early. Please do not buy it yet.

Have a great weekend!

Nondeterminism in Formal Specification

11 June 2024 at 14:30

Just an unordered collections of thoughts on this.

In programming languages, nondeterminism tends to come from randomness, concurrency, or external forces (like user input or other systems). In specification languages, we also have nondeterminism as a means of abstraction. Say we're checking if a user submitted the correct password. A software implementation might look something like this:

# pythonish
def attempt_login(user, pw):
    submitted_hash = hash_password(pw)
    actual_hash = db.get_password(uid=user.id)
    if submitted_hash == actual_hash:
        login()
    else:
        reject()

There's lots of implementation details in confirming the correct password: how we hash it, how we determine the right hashing mechanism, where we retrieve the actual password from, what query we use to retrieve it, etc. At a higher level of design, though, we might not care about any of that: we only want to model what happens if the password is correct, not how we determine correctness. So we can use nondeterminism to handwave that all away:

\* TLA+
AttemptLogin ==
    \/ Login \* \/ means `either`
    \/ RejectLogin

When the model checker tries to evaluate AttemptLogin, it will branch and create separate checking executions for picking Login or RejectLogin. This gives us the consequences of "user submits the wrong password" without having to write a bunch of spec figuring out what the wrong password is.

Mastering abstraction-nondeterminism is key to writing good specifications. It's why you can model a complex, detailed system with a much smaller specification and find bugs that map back to actual possibilities.

Two challenges of nondeterminism

  1. Nondeterminism is incredibly capable, which makes it not very tractable (ibid). It's hard to express properties on subsets of the model-space, like "If a user never gets rejected, then property XYZ always holds."
  2. Abstraction-nondeterminism pushes the spec further from the real-world implementation, so it's harder to see that your implementation matches it.

These tend to be bigger problems if the nondeterminism is deep in your design, kinda like functional core, imperative shell. We want to push it closer to the edges:

AttemptLogin(pwd) ==
    IF pwd = GOOD THEN
        Login
    ELSE
        RejectLogin

Next ==
    \E pwd \in {GOOD, BAD}:
        AttemptLogin(pwd)

It's still nondeterministic, but it happens earlier, so is easier to control. We could for example swap {GOOD, BAD} with {GOOD} to guarantee that login is always successful.

In the extreme case, each component of the machine is locally deterministic, even if the components are concurrent. This is closest to something implementable. The world (ibid) can continue to be nondeterministic.

Initial states and prophecy variables

Specification languages can also have multiple initial states. We can use this to push nondeterminism into the spec setup by adding variables that encode what events will happen. For example, we could initialize the variable attempts_needed as an integer in 0..3, then write AttemptLogin like this:

AttemptLogin(pwd) ==
    IF pwd = GOOD THEN
        Login
    ELSE
        RejectLogin

Next ==
    IF attempts_needed > 0 THEN
        /\ AttemptLogin(BAD)
        /\ attempts_needed' = attempts_needed - 1
    ELSE 
        /\ AttemptLogin(GOOD)

Now, for each initial state, the spec is wholly deterministic, and thus tractable. We can easily distinguish properties that hold if the user never sends the wrong password (attempts_needed = 0) vs if they get it wrong a few times.1 We can also use this to check only a subset of the specification at a time, which can help with checking very large state spaces.

Leslie Lamport (the inventor of TLA+) calls these kinds of helpers prophecy variables, because they "predict" what nondeterministic action the system will take. You can't always use prophecy variables to completely remove nondeterminism from your specification, but it can help!

Refinement and nondeterminism

One of the main use cases of refinement— proving one specification implements another— is to remove abstraction nondeterminism from a system. I really need to add something about this to learntla.

An Exercise

I think this might be a good exercise to practice writing nondeterminism:

In the game Get Fifteen, players take turns picking numbers 1..9. Players can pick each number only once and cannot pick numbers their opponent picked. Winner is the first player to have three picked numbers that add to 15. This is identical to playing TicTacToe, but is easier to formally specify (no grid).2

Model a player tictactoe against an AI. Either can start first. The player needs to follow a deterministic strategy you implement. The AI just nondeterministically picks any available number. The property to check is "the player never loses".

(I haven't actually modeled this, but if people give it an honest shot and run into problems I'd be happy to give the specs a look!)


Plan for June: Logic for Programmers

I've decided to make a serious attempt to write an acceptable first draft of Logic for Programmers by July 15. If I don't have something I can readily turn into a Leanpub early-access by then, I will give $1k to a random newsletter subscriber (or donate it to a charity of their choice). I reserve the right to push the deadline if I get a new consulting contract.


  1. one thing we've lost is the user can no longer fail infinite times; they're always guaranteed to log in eventually. This is fixable in a few ways. 

  2. If you're not seeing it, try overlaying a tictactoe board on top of a magic square

I've been thinking about tradeoffs all wrong

5 June 2024 at 15:14

Sup nerds, I'm back from DDD Europe! Still processing the conference and adjusting to the horrible jetlag but also itching to get writing again, so let's go.

The opening keynote to the conference was called "Modern Tradeoff Analysis for Software Architecture". It was mostly about the factors that went into whether you should split up services into microservices and glue them together. Unsurprisingly "tradeoff" came up a whole lot: between throughput and latency, between simple and distributed transactions, between stability and volatility, etc. At the end he gave the example of comparing a queue-based architecture (SQS) to a pub-sub topic-based architecture (SNS).1 He said that the queues made it harder to add new services, but the topics make it harder to do access-control.2

That's when I realized I've been thinking about tradeoffs all wrong!

See, normally I think of tradeoffs as framing two positive things, like saying "SQS has better control while SNS is easier to add new services". Instead the speaker framed it as two wholly negative things, and the tradeoff is which negative thing you want less of. We can do this with any tradeoff, because any positive quality about X can be turned into a negative quality of Y. Instead of "X is faster, Y uses less space", we can say "X uses more space, Y is slower."

I don't know if this framing was intentional on the speaker's part, but I like it so much more. What's the difference? Even if you pick the least-bothersome drawback, you still have a drawback. Going back to the SQS vs SNS example, let's say you hate the how hard adding new services to SQS is, so you switch to SNS. Then you realize it hard to do access-control, and this is a worse problem than the SQS problem was. So the right tradeoff is to go back to using SQS.

But now it's hard to add new services again. And that sucks! You don't get to stop thinking about the pros and cons just because you made a decision. There's still a problem you have to deal with.

Two rules of thumb

Rule one: after I make a decision about a tradeoff, I need a plan to address the negative consequences of my choice, in a way that doesn't also add in the negative consequences of the other choice.

Rule two: this explains a lot of the movement (or more cynically, churn) in software tech. Even when we make the right choice overall, in the moment we're still under a lot of pain, and so we look for new tools/technologies/practices that address the pain we currently feel.

Bonus rule: part of the reason we do Software Mimicry is to reduce the pain of a choice we made.


The conference talk was my last big project for spring. My consulting calendar is free so I'll try to devote some time to my personal projects in "progress" (learntla updates, alloy workshop, logic book). It's a little tough without a hard deadline!

No "things I'm curious about" because I'm still v. tired


  1. He used SQS and SNS because they're the names of the AWS services. I'm going to use them as shorthands for the architectures because SQS is easier to write than "queue-based architecture" 

  2. I don't know if this actually true, but I'll go with it because that's how he framed it. 

NoCode Will Not Bring Computing to the Masses

21 May 2024 at 17:54

No Newsletter next week

I'll be giving my conference talk at DDD Europe.

NoCode Will Not Bring Computing to the Masses

I don't have a whole lot of time this week so here's something that's been on my mind a little. I haven't researched any of the factual claims; consider this more some incoherent early-stage theory.

Define a "NoCode" tool as "any tool made to help people without any programming experience do things that normally require programming." Most software engineers have a very dim view of NoCode, mostly because the primary way SEs interact with it is to clean up the messes left by overly ambitious nocoders. I don't think it'd be controversial to say there's a wide variety of problems nonprogrammers should not try to solve with NoCode tooling.

On the other side of the spectrum, there are a lot of problems that are trivial to do if you can program, and impossible without it. The canonical example I use is "upcasing a string"; others are "find strings of digits in a document", "figure out each student's average score across several tests", etc. If you don't know how to program, you either have to find someone else's program that does this thing or do it manually. This is where I think NoCode has real potential to help the average computer user: it can let them do simple programming tasks without having to learn all of programming first.

The goal, then, is to figure out what defines the space of "trivial for programmers, impossible for regulars". I think a good first-metric would be "equivalent lines of python". If something can be done in L=10 lines of not-very-clever Python then it should hypothetically be nocodable. If it takes H=50 lines, it definitely should not be. In between is a grey area. These numbers are just off the top of my head; finding the real values of L and H would require a lot of research and user examples.1

NoCode adapted to this space probably wouldn't be that profitable, but it would do the most good for the most number of people.

"The most good" will do the least good

I wanted to talk about some of the consequences for NoCode design, but couldn't figure out where to start, so did the wise thing and took a walk. On the walk I realized that no matter how much a tool empowers the user, almost nobody will use it.

Proof: how many programmers have iPhones? How many of them use Shortcuts? I mean, I have Tasker on Android, I know how to use it and have used it in the past, and still I regularly don't use it.

It's not enough for a tool to solve your problem for you to use that tool. You have to also A) recognize the problem you have is solvable, B) recognize the tool can solve it, and C) be reasonably confident that you personally can solve the problem with the tool. You need knowledge, skill, and the right mentality. Even programmers, who already have that mentality, don't automatically generalize it to all applicable domains. Otherwise we'd be doing a lot more with our phones.

It's unreasonable to expect the average person will use any NoCode to solve their problems, regardless of how easy the tools are. This leaves the people for whom it provides economic value, which is why all modern NoCode tools are oriented towards business.

Do people want grey markets?

There's two exceptions to this rule I can think of.2 The first is Excel/GSheets, which we do see regular people pick up. I'd venture this is because "very simple data organization" and "semi-complicated math stuff" is a common enough thing enough people need. Maybe also they have gentle on-ramps: you can solve basic problems while knowing very little. Finally, there could be a social aspect; everybody already knows about spreadsheets, so it's not as big a jump to start using them.

The other major exception is that I've seen a lot of people who do use tools like Shortcuts and Tasker… as consumers of other people's published automations. These tend to look more like "scripts" than "applications", suggesting that our official marketplaces are designed to deliver applications, but people also need scripts.3 The NoCode gives them a "grey market" they can use to install scripts without using the official heavyweight channels.

One negative consequence of this is that any widely-used NoCode also becomes a potent vector for malware, because the people who are running scripts are ill-equipped to audit them.

Ok, I checked the code and it said (and please bare with me here):

-If flibbity is frabjous, zibble wobble at 42% and blippity blop playlist; otherwise, schnoofle Grobber Mode.

SOMEBODY HELP — source


Trying to pull all of my thoughts into something coherent:

  1. People need scripts.
  2. NoCode can make scripting much easier.
  3. Even if NoCode makes scripting easier, people won't actually write scripts.
  4. What people want is to be easily able to find and run someone else's script like they do an application.
  5. NoCode could make finding and running other people's scripts easier, too.
  6. Doing so is potentially very dangerous, as people aren't able to evaluate scripts for safety.

  1. Have you ever noticed that High > Low, but H in vim moves you backwards and L moves you forward? Deep, man. 

  2. IFTTT/Zapier might be a third exception: people disconnecting disparate services with a NoCode tool. But I don't have enough firsthand knowledge to be able to say anything about them, and I don't know how popular they actually are. 

  3. The difference between "scripts" and "applications" seems very important to all this. See Why bother with Scripting? for a good discussion on this. 

What I look for in empirical software papers

16 May 2024 at 16:58

Behind on the talk and still reading a lot of research papers on empirical software engineering (ESE). Evaluating a paper studying software engineers is a slightly different skill than evaluating other kinds of CS papers, which feel a little closer to hard sciences or mathematics. So even people used to grinding through type theory or AI papers can find ESE studies offputting.

So here are some of my thoughts on how I approach it. This is mostly unedited and unorganized, just getting my thoughts down.

The high-level view

The goal of ESE is to understand how we develop software better in the real world. Compared to "regular" computer science, ESE has two unique problems:

  1. As with all kinds of knowledge work, software development is highly varied and lots of factors affect the final outcome.
  2. Software engineers develop new tools/processes/paradigms incredibly quickly and there's maybe a 1000:1 ratio of developers to researchers.

This makes it hard to say anything for certain with a high level of confidence. Instead of thinking of whether papers are correct or incorrect, I look for things that make me trust it more or less. I'll use this interchangeably with "like/dislike": "I like this study" is the same as "I trust it more".

Quantitative Metrics

I like metrics that are clear, unambiguous, and sensible to practitioners. "Quality" and "defects" are too ambiguous, "coupling" and "cohesion" aren't sensible. "Number of support tickets filed" and "distribution of model code lengths" are better.

In practice, clear and unambiguous metrics end up being more narrowly-scoped. I trust research on narrow metrics more than research on broad metrics, even if the former are less applicable to software at large.

Clinical Trials

I don't like "clinical trials", where developers are split into control and experiment groups and asked to complete a task. The problem is sample size: actual clinical trials in medicine are run on thousands of people at a time and cost millions of dollars. Most software clinical trials study less than 50 engineers at a time. With such a small sample size, there's just too much room for confounding factors.

It doesn't help that most studies are done on university students, which may not generalize to more experienced professionals.

There's a couple of exceptions, though. I'm more inclined to trust clinical trials how how to teach CS, as students are the target population anyway. And I pay more attention to clinical trials that find enormous effects. For example, Need for Sleep finds that a night of sleep deprivation reduced the trial group's coding quality by fifty percent.1 That's not the kind of effect you can write off as a confounding factor!

(I'd still like to see it replicated, though. Also, does mild sleep deprivation effect coding skill?)

Other Quantitative Research

Scraping GitHub for data is fine, as long as the researchers are careful to audit the data. Don't publish a groundbreaking paper claiming a correlation between language and program quality but accidentally count each bitcoin bug as 17 C++ bugs.

Developer surveys are good for measuring things like developer perceptions, emotional state, how widespread a technology is, etc. I don't think they're good at measuring the effectiveness of practices or determining if those opinions match reality. IE if a survey showed that 90% of developers think that their Java code is buggier than their Kotlin code, this tells us that developer perceive Kotlin as higher quality, not it actually is!

Qualitative Studies

I like qualitative studies, things that aren't gathering hard metrics. If a quantitative study of correctness imperative vs logic programming would be "which has fewer bugs" or "what kinds of bugs does each type have", qualitative would be more like "how do imperative and logic programmers debug stuff differently" or "let's dissect this one unusual bug and see if we learn anything interesting."

IME science fields undervalue the impact of qualitative research. I find a lot less of it than quantitative papers, but I tend to trust the qualitative papers more. Maybe they're just overall higher quality; more likely it's because they're more narrowly scoped. And I believe (without evidence) that a field needs a strong foundation of qualitative knowledge to do stellar quantitative work.

Some papers do both qualitative and quantitative work. The researchers in Abbreviated vs. Full-word Identifier Names started with a clinical trial, and then shadowed individual developers to see how they navigated a code file. I trust the quantitative sections of hybrid papers a little more than just-quantitative papers, but it's not an enormous factor.

"Quantitative" and "qualitative" aren't words anymore.

Reading the Paper

I use the same basic approach with all CS papers: read the abstract and intro, read the conclusion and post-results discussion, then decide if I care enough to look further. If I like the results, I read the methodology more carefully. I skim background info and "related work" mostly to see if they make any claims I'd want to read up on.

Sometimes I'll use an AI to generate a paper summary. I only do this if I'm going to read the full paper; the summary is just to help me digest it.

I trust papers that share their data.

Threats to validity

Most software papers have a "threats to validity" section, where they talk about possible confounding factors. In practice it's mostly a way for the writers to show they covered their bases, by explaining why said confounding factors aren't a problem. I like papers which include TtVs they can't dismiss, or discussing unusual-but-important TtVs. I don't like papers which have bad explanations for why TtVs aren't a problem, or that miss really obvious TtVs.

Example of doing it wrong: Quantifying Detectable Bugs in JavaScript claims that typescript catches 15% of "public bugs for public projects on GitHub". Their threats to validity were "we only looked at public bugs" and "we may have been too inexperienced in typescript to catch even more bugs." One TtV they do not list is that they uniformly sampled all of GitHub for bugs and didn't filter the projects at all. When I checked some of their projects, a bunch were simple lesson repos by novices in code bootcamps.2 Why didn't they mention this TtV?

Example of doing it right: oh God it's already Thursday and I haven't gotten this newsletter out yet

References

For many papers, the most useful part is the reference section, because odds are I'll find a more interesting paper there. Even if none of the references look interesting, I'll still check a few to make sure the original paper is using them correctly. I don't like it when a paper that say reference P makes claim C about topic T, and then I read R and it never even mentions T.

I wish journals let papers have two reference sections, one for the really essential references and one for the incidental ones, like background information or related work.

...Okay, that's all I got this week. Back to the talk mines.


Things I'm curious about

  • DORA Report: What's the state of research about the DORA report? Has anybody been able to replicate their findings? Has anyone evaluated them critically?
  • Puzzle Games: Puzzle games like flow free have thousands of levels. Are these programmatically generated? If so, how is it done, and how do they control the difficulty?

  1. Measured as as "number of tests passed", from a researcher-provided test suite for the task. 

  2. I checked this back in 2019 or so, but it seems like the data is now offline. I checked this back in 2020 or so. Props to making the data available! 

Paradigms succeed when you can strip them for parts

7 May 2024 at 16:48

I'm speaking at DDD Europe about Empirical Software Engineering!1 I have complicated thoughts about ESE and foolishly decided to update my talk to cover studies on DDD, so I'm going to be spending a lot of time doing research. Newsletters for the next few weeks may be light.

The other day I was catching myself up on the recent ABC conjecture drama (if you know you know) and got reminded of this old comment:

The methods in the papers in question can be used relatively quickly to obtain new non-trivial results of interest (or even a new proof of an existing non-trivial result) in an existing field. In the case of Perelman’s work, already by the fifth page of the first paper Perelman had a novel interpretation of Ricci flow as a gradient flow which looked very promising, and by the seventh page he had used this interpretation to establish a “no breathers” theorem for the Ricci flow that, while being far short of what was needed to finish off the Poincare conjecture, was already a new and interesting result, and I think was one of the reasons why experts in the field were immediately convinced that there was lots of good stuff in these papers. — Terence Tao

"Perelman's work" was proving the Poincaré conjecture. His sketch was 68 pages of extremely dense math. Tao's heuristic told him it was worth working through is that it had interesting ideas that were useful to other mathematicians, even if the final paper was potentially wrong.

I use this idea a lot in thinking about software: the most interesting paradigms are ones you can "scavenge" ideas from. Even if the whole paradigm isn't useful to you, you can pull out some isolated ideas, throw away the rest, and still benefit from it. These paradigms are the ones that are most likely to spread, as opposed to paradigms that require total buyin.

Let's take as an example functional programming (FP). ML/Haskell-style FP has a lot of interconnected ideas: higher-order functions, pattern matching, algebraic data types, immutable data structures, etc. But if you ignore all of that away and stick to writing everything in idiomatic Java/Python/C, but the only change you make is writing more pure functions, then you've already won.

This has several positive effects on FP adoption. It immediately demonstrates that FP has good ideas and is not just some weird cult practice. It also means people can "dip their toes in" and benefit without having to go all in. Then they can judge if they're satisfied or want to go further into learning FP. And if they do go further, they can learn new ideas gradually in stages.

By contrast, compare array programming languages (APLs). There's lot of cool ideas in APL, but they're so interconnected it's hard to scavenge anything out. If I want to use multidimensional arrays in Python I can't just whip it up myself, I have to import an external library like numpy and learn a whole bunch of new ideas at once. So most of the ideas in APL stay in APLs.3

A related idea how techniques are only adopted if you can "half-ass" them. TDD is useful even if you only use it sometimes and phone in the "green-refactor" steps, which helped adoption. Cleanroom depends on everybody doing it properly all the time, which hindered adoption.

Scavenging from formal methods

Part of the reason I'm thinking about this is that business has been real slow lately and I'm thinking about how I can make formal methods (FM) more popular. What ideas can we scavenge from writing specifications?

Obvious ones are predicate logic and the implication operator, but they're both poorly supported by existing programming languages, and I'm not sure how to best pitch them.2 Also, they're "too big", and it takes a lot of practice to wrap your mind around using quantifiers and implication.

Sets, definitely. Sets are unordered, unique collections of objects. Lots of things we store as lists could be sets instead. I use sets way more after learning some FM.

There's also a lot of high-level ideas in FM that usefully apply to all other code. System invariants are things that should always hold in every possible state. More broadly, we can categorize properties: "safety" is roughly "bad things don't happen", "reachability" is "good things always could happen", and "liveness" is "good things always do happen". Fairness, maybe. Refinement is definitely too in-the-weeds to be scavengable.

I'd also call out decision tables as a particular technique that's extractable. You can teach decision tables in five minutes to a completely nontechnical person.

As I said, short newsletters for the rest of this month. Have a great week!


Things I'm curious about

  • Logic Programming: What ideas are scavengable from LP? I know that pattern matching originated in LP and found its way to FP from there, but that's associated with FP more now.
  • Nontechnical People: What's a good word for them? "Nondev" seems like it leaves out data scientists, DBAs, etc. "Normie" and "Muggle" are too offensive. "Layfolk" just sounds weird.
  • TLA+ workshops: Would people be interested in an "intermediate" TLA+ workshop, for people who've already written production specs but want to get better? I'm thinking half a day, 6-8 people, with people bringing up what they want to learn about in advance. If there's enough people who'd want this I can plan something out.

  1. Domain-Driven Design. 

  2. This is one big reason my attempt to write a book on predicate logic has been in a long stall. 

  3. One idea that mathematicians scavenged from APLs is the Iverson bracket: [P] = P ? 1 : 0

"Integration tests" are just vibes

1 May 2024 at 15:52

New blog post! Software Friction is about how all the small issues we run into developing software cause us to miss deadlines, and how we can address some of them. Patreon here.

"Integration tests" are just vibes

You should write more unit tests than integration tests. You should write more integration tests than unit tests. "Integrated tests" are a scam.

But what exactly is an integration test? Is it tests that cross module boundaries? Is it tests that depend on several pieces of behavior, even in one module? Is it anything that tests "emergence"1? Anything with side effects? If get_active_users() runs a database query, are all tests against it integration tests?

No wonder both Twitter and Google have (at one point) claimed that "integration test" wasn't a useful term. There are just too many different things people mean "integration".2

Even if we can't define integration test, we still have recognize them. There are tests that are bigger than our "unit tests" and smaller than an end-to-end test, and people have strong opinions about writing more or less of them. We can have a sense of something being an "integration test" even if we can't strictly define it.

I think it's more useful to think of integration tests in terms of family resemblances. There are certain measurable properties of tests, and integration tests measure relatively different than a unit test. One quality is test speed: unit tests are "fast", integration tests are "slow". This does not mean all integration tests are slower than any unit test (unit property tests can be very slow). But slow tests are more likely to be integration tests than unit tests.

"Integration tests" are a vibe we get based on how we perceive a test's properties. Some of these properties are:

  1. The number of lines of production code executed as part of the test.

  2. Number of boundaries crossed, module/object or otherwise.

  3. Number of unit tests you can decompose the test into.

  4. Number of external services required to run the test.

  5. Amount of engineering effort to write/run the tests. Does it need separate files, a separate language, setting up lots of docker instances?

  6. Nonlocality of discovered bugs: if the test fails, how many lines you have to look at to find the problem.

  7. Likelihood a test failure is due to "emergence": things are correct in isolation but interact in a problematic way.

The more of these have high values, the more likely a test is to be an integration test.

Miscellaneous Thoughts

A test that hits the database (external service) can still be a unit test if it runs quickly and covers a small amount of code. Tests that make HTTP calls are less likely to be fast, so more likely to be integration tests.

Mocks and pact-style testing push some of these values down, by reducing boundaries crossed and lines of code executed at the cost of increased test infrastructure. Contracts and assertions increase bug locality, since the test fails the moment a pre/postcondition is broken instead of when the test completes.

If you push the values high enough, do you get end-to-end tests instead of integration tests? Or are E2E tests fundamentally different from integration tests in some way?

Is it possible for a unit test to uncover emergence, or is emergence a sufficient condition to say that something is definitely an integration test?


Things I'm curious about

Gonna make this a weekly thing maybe (or whenever I have enough new stuff to ask about).

  • Sokoban. Sokoban is "easier" than Chess or GO (PSPACE vs EXPTIME), but Sokoban AIs regularly fail at simple puzzles that humans can do (ref ex1 ex2). What makes Sokoban "so much harder" for AIs? I don't think it's just a lack of interest because I've found Google and Deepmind papers on solving sokoban (and not making much progress).
  • NFCs: I impulse-bought a bunch of NFC chips. I'm already using them to open train trackers, what else can I do with them? I'm not a mobile dev but can whip up Taskers and server code pretty quickly.

If you know something about either of these, feel free to share! Or write about them publicly and share the link with me. You can just reply directly to this email.


  1. Geeze I can't believe I tried to make "automanual" a thing 

  2. "Unit test" is also fuzzy but we at least agree that add(2, 3) == 5 is a unit test. 

"Testing can show the presence of bugs but not the absence"

23 April 2024 at 20:19

Program testing can be used to show the presence of bugs, but never to show their absence! — Edgar Dijkstra, Notes on Structured Programming

Dijkstra was famous for his spicy quips; he'd feel right at home on tech social media. He said things he knows aren't absolutely true but will get people listening to him. In the same essay he discusses how testing could, in specific cases, show the absence of bugs:

  • If a function has a finite set of inputs, like floor(float), you could prove correctness by simply testing all possible inputs.
  • Even if a function has infinite inputs, tests prove the absence of bugs for tested inputs. You could say "date_of_easter works for any date within 10,000 years of today" and have "proven the absence of bugs" for the subset of dates people care about.1
  • You can use testing to complete a proof. For example, you can divide inputs into "equivalence classes", prove that if the function is correct for one value it will also be correct for all values in that class, and then test one value from each class.

In the majority case, however, testing cannot show the absence of bugs. Testing f(16) does not give you any guarantees about all x. f(x).

There are two possible takeaways. The reasonable one is that we shouldn't be blindly confident in testing, should use a variety of approaches as part of defense in depth, and always be careful about what we claim. The unreasonable takeaway is that testing is an inferior verification method, and we should use more "rigorous" approaches instead of testing.

It's unreasonable because methods that show the existence of bugs are more robust than methods that show the absence of them, and this makes them more useful more of the time.

Comparing tests and proof

Let's say you have a pseudocode function like this:

fun elem_before_x(s: int[], x: int) returns (r: int | NULL) {
    let i: int = index(s, x);
    if (i == NULL) {
        return NULL;
    } else return s[i-1];
}

Property:
 if r == NULL then x notin s
 else some i in 0..<len(s): s[i] == r && s[i+1] == x

The bug here is that we're out-of-bounds if x is the first element of the array. Testing will find this bug if you remember to test that case. Otherwise, we'd have a bug and a passing test suite. Tests can't verify the absence of bugs.

Now imagine we do find it and fix it:

else if (i == 0) {
        return s.last;
    } 

Modifying the property is left as an exercise to the reader. Once again, the passing test suite means we haven't verified the absence of the bug. Now, how does formal verification do on the same problem? I wrote a Dafny version of the same problem, here, both with and without the fix. Looking at v1, we see it finds the s[0] error:

>   var v := index.Extract();
>   return Some(s[v-1]);

(19,16): Error: index out of range    
   |
19 |     return Some(s[v-1]);

This error also goes away in v2: the bug is proved to be absent. But v2 adds a new error:

> ensures if r.None? then x !in s
>           else if s[|s|-1] == r.value then s[0] == x
>           else s[IndexOf(s, r.value)+1] == x

(7,20): Error: index out of range
  |
7 |             else if s[|s|-1] == r.value then s[0] == x
  |                      ^^^^^^^

When would s[|s|-1] be an index out of range? When the list is empty. But we know that's not possible: if we return a non-null value then the list must not be empty, because we found x in it! But the prover doesn't know that, it can't infer the relationship between the returned value and what we know about the list.2 We have to annotate our code with proof hints for it to recognize that if we return a number, the list is nonempty.

In short: the prover definitely finds the out-of-bounds bug, but also raises a spurious "empty list" bug. The test might not find the former but also won't raise the latter. Tests can have false negatives but not false positives, while provers can have false positives but not false negatives.

(I'd say "formal verification can prove the absence of a bug but never the presence", but that's not true. Dafny can sometimes give you a concrete counterexample to a property, and type systems detect true positives like "you passed an array into an Int -> Int function" all the time.)

Knowing there "might" be a bug is not as useful as knowing there is one. Empirically speaking people mostly ignore compiler warnings and plenty of retrospectives find people ignore "might possibly be a bug" in even mission-critical software. Formal verification often becomes an "all or nothing" deal, either you totally prove the absence of bugs in programs or you get no benefits at all.

And that sucks! It takes a lot of effort to prove something correct, magnitudes more than running some tests. And that's why I consider testing "more robust": even a little bit of testing may find a true bug, while a little bit of verification doesn't get you much.

Correctness is not a thing

I'll end this by getting on my soapbox. Here's something that's really important to understand about "proving the absence of bugs":

You do not prove something's correct. You prove conformance to specification.

It's possible for code to conform to the specification and still have bugs. Two common cases are:

  1. The spec is wrong or incomplete, like v2.dfy is in the above example. Spoiler in the footnote.3
  2. The spec is based on an assumption that's not true in practice, like "no other program is writing to the same file."

This is also important because it puts the specification front-and-center. You can write tests without having a specification in mind, but you need a spec to do any kind of formal method. Doing FM trains you to think about specs, which in turn helps you write better tests.


Things I am interested in

  1. Crossovers from fields besides classical engineering, who worked professionally as something else before switching to software. What ideas from your old job as a journalist/librarian/social worker/acrobat could be useful in software, and vice versa?
  2. Wargaming in software. Have you designed or participated in a devops/security/system design/whatever simulated game? Was it helpful?

If you have insights into either of these, feel free to share! Or write about them publicly and share the link with me. You can just reply directly to this email.


  1. Though you could argue this isn't sufficient for security. If easter(-1e8) causes a memory overflow then that's a bug even if a normal user won't ever call it. 

  2. We're lucky that the Dafny standard library defines a postcondition for index functions saying the output is less than the length of the list. Otherwise the prover would raise a potential out-of-bounds error on s[IndexOf(s, x)]

  3. s[|s|-1] == r.value then s[0] == x isn't true for the input [a, r, x, r]. The good news is that Dafny refuses to verify the code because it didn't match the spec. I only realized this when I was error-checking this piece. (This doesn't refute the point I made above about false positives: the false positive was that s[|s|-1] was out of bounds!) 

What makes concurrency so hard?

16 April 2024 at 20:28

A lot of my formal specification projects involve concurrent or distributed system. That's in the sweet spot of "difficult to get right" and "severe costs to getting it wrong" that leads to people spending time and money on writing specifications. Given its relevance to my job, I spend an awful lot of time thinking about the nature of concurrency.

As the old joke goes, concurrency one of the two hardest things in computer science. There are lots of "accidental" reasons why: it's hard to test, it's not composable, bugs can stay latent for a long time, etc. Is there anything that makes it essentially hard? Something that makes concurrent software, by its very nature, more difficult to write than synchronous software?

The reason I hear most often is that humans think linearly, not concurrently, so are ill-equipped to reason about race conditions. I disagree: in my experience, humans are very good at concurrent reasoning. We do concurrent reasoning every time we drive a car!

More generally, some studies find that if you frame concurrent systems in human terms ("meatspace modeling"), people get quite good at finding the race conditions. So while concurrency might be difficult to reason about, I don't think it's because of a fault in our brains.

In my opinion, a better basis is state space explosion. Concurrency is hard because concurrent systems can be in a lot of different possible states, and the number of states grows much faster than anyone is prepared for.

Behaviors, Interleavings, and Nondeterminism

Take agents1 {1, 2, … n}, which each executes a linear sequence of steps. Different agents may have different algorithms. Think writing or reading from a queue, incrementing a counter, anything like that. The first process takes p1 atomic steps to complete, the second p2, etc. Agents complete their programs strictly linearly, but another process can interleave after every atomic step. If we have algorithms A1A2 and B1B2, they could execute as A1B1A2B2 or A1B1B2A2, but not A1B2B1A2.

Under those conditions, here's an equation for how many possible orderings of execution (behaviors) can happen:

(p1+p2+...)!
------------
p1!*p2!*...

If we have three agents all executing the same 2-step algorithm, that's 6!/8 = 90 distinct behaviors. Each step in the sequence can potentially lead to a distinct state, so there's 6*90=540 maximum distinct states (MDS). Any one of those can potentially be the "buggy" one.

In most cases, the actual state space will be significantly smaller than the MDS, as different behaviors will cross the same states. Counterbalancing this is just how fast this formula grows. Three 3-step agents gives us 1700 possible behaviors (15K MDS), four 2-step agents instead have 2500 (20K MDS). And this is all without any kind of nondeterminism! If one step in one of the agents can nondeterministically do one of three things (send message M₁, send message M₂, crash), that triples the number of behaviors and the MDS.

It's pretty common for complex concurrent systems to have millions or tens of millions of states. Even the toy model I use in this article on model optimization has about 8 million distinct states (though with some work you can get it much lower). I mostly think about state spaces in terms of performance because large state spaces take a lot longer to model-check. But it's also why concurrency is essentially hard to reason about. If my theoretical MDS is two million states, my practical state space is just 1% that size, and my human brain can reason through 99.9% of the remaining states… that still leaves 20 edge cases I missed.

Shrinking the State Space

Here's a heuristic I use a lot:

All means of making concurrency 'easier' are concerned first and foremost with managing the state space.

It's not 100% true (no heuristic is) but it's like 60% true, and that's good enough. State spaces grow quickly and bigger state spaces cause problems. If we want to make maintainable concurrent systems, we need to start by shrinking the space.

Like look at threads. Threads share memory, and the thread scheduler has a lot of freedom to suspend threads. So you have lots of steps (potentially one per line of code)2 and any interleaving can lead to a distinct state. I can use programming constructs like mutexes and barriers to "prune" the state space and give me the behaviors I want, but given how big the state space can be, I have to do a lot of pruning to get the right behaviors. I can make mistakes in implementation, "misshape" the space (like by adding a deadlock), or not notice a buggy state I need to remove. Threads are very error prone.

I could instead switch to memory-isolated processes. The scheduler can still schedule whatever interleavings it wants but the processes can't muck with each other's internal memory. Internally a process is still executing A1A2A3A4, but if the only steps that involve external resources (or interprocess communication) are A2 and A4, then we only need to think about how A2A4 affects the state space. Three four-step agents have 35k interleavings, three two-step agents have only 90. That's a big improvement!

What else can we do? Low-level atomic instructions do more in a single step, so there's no room for interleaving. Database transactions take a lot of physical time but represent only one logical time step. Data mutations create new steps, which immutable data structures avoid by definition.

Languages have constructs to better prune the resulting state space: go's channels, promises/futures/async-await, nurseries, etc. I think you can also treat promises as a way of forcing "noninterleaving": wait until a future is ready to execute in full (or to the next yield point) and before execution. Please don't quote me on this.

I think CRDTs reduce state space by permitting interleavings, but arranging things so that external changes are commutative: A1B1 and B1A1 lead to the same final result, so there are not distinct states.

Again, this is all a very rough heuristic.

Limits

To a first-order approximation, smaller state space == good. But this doesn't account for the topology of the space: some spaces are gnarlier than others. One that has lots of different cycles will be harder to work with than one that's acyclic.3 Different forms of nondeterminism also matter: "the agent continues or restarts" leads to more complex behavior than "the writer sends message M₁ or M₂." These are both places where "humans are bad at reasoning about concurrency" appears again. We often work through concurrency by reduces groups of states to "state equivalence classes" that all have the same properties. complicated state spaces have more equivalence classes to work through.

(Another way "humans are bad at reasoning about concurrency" can be a real thing: we might not notice that something is nonatomic.)

Some high-level paradigms can lead to particular state space topologies that have fewer interleavings or ones that have more equivalent states. I've heard people claim that fork-join and pipe-filter are especially easy for humans to reason about, which I take to mean "doesn't lead to a gnarly state space". Maybe also event-loops? Where does the actor model fit into all of this?

Another limit: there's a difference between buggy states and buggy behaviors. Some behaviors can go entirely through safe states but still cause an bug like "never reaches consistency". This is called a "liveness bug", and I talk more about them here. Liveness bugs are much harder to reason about.

Okay, that's enough philosophical rambling about concurrency. Concurrency is hard, don't feel bad if you struggle with it, it's not you, it's combinatorics.


Video Appearance

I was on David Giard's Technology and Friends talking about TLA+. Check it out here!


  1. An agent is any concurrent actor, like a thread, a process, a future, a human, etc. Concurrency literature uses "processes" a lot but that clashes with OS processes. 

  2. Or more! If writes are nonatomic, you can get data races, where two threads write to different bits in a byte at the same time. Then a variable assignment can be multiple steps in one line. 

  3. Is there any research correlating unusual subgraphs in state spaces with probable bugs? 

Some notes on for loops

10 April 2024 at 17:08

New Blogpost

Don't let Alloy facts make your specs a fiction, about formal methods practices (and a lot of Alloy). Patreon link here.

Some notes on for loops

I sometimes like to sharpen the axe by looking at a basic programming concept and seeing what I can pull out. In this case, for loops.

There are two basic forms of for loops:

for(i = 0; i < 10; i++)

for val in ["a", "b", "c"]:

The second is usually called a foreach loop, and I'll call the former "forstep", after the ALGOL-60 construct. Almost all languages have one of the two, some have both. While forstep loops show up early, foreach loops only appear in PLs ~20 years later.

Looking at this, here's some things I see:

Construct reduction

Given a foreach loop, I can syntactically translate it into a forstep loop:

for val in arr:
    # ...

-->

for(i = 0; i < arr.len; i++) {
    val = arr[i];
    // ...
}

There's a lot of terms we use to express this property. We can say that foreach is homomorphic or reducible to forstep, or that foreach can be constructed out of forstep, or others I can't think of rn.

If we could also construct forstep from foreach, we could say the two are equivalent or isomorphic. In theory we can do this if we had a range operator:

for(i = 0; i < 10; i += 2) 

--> 

for i in range(0, 10, step=2);

Most forstep implementations, though, can do things like modify the iterator in the middle of the loop:

for(i = 0; i < 10; i++) {
    // stuff
    if(cond) i = 0;
    // stuff
}

You can't convert this to a foreach loop: it's an "irreducible" feature.

Most people consider modifying the index bad practice, though. 99% of the use cases of forstep loops are constructible with foreach loops, so the two are morally equivalent. I'll treat them as roughly interchangeable going forward.

Primitives and Expressivity

For loops are reducible to while loops:

for(i = 0; i < 10; i += 1) 

-->

i = 0;
while i < 10 {
  // ...
  i++;
}

But while loops are not reducible to for loops:

while(user_cmd != "quit")

-->

???

Again, lots of terms to cover this: while loops are more "primitive", more "powerful", more "low-level", more "expressive", etc. If you have while loops, you don't need for loops, and if you have goto, you don't need either.

But you shouldn't use a goto when a while loop works or a while loop where a for loop works. This is a generalization of the rule of least power (RLP): given a problem, use the tool that is most specific to the problem. If you can iterate through a list with a foreach loop, don't use a while instead!

There's two reasons for this. One is that it's more communicative to other programmers. If they see a foreach loop, they know it's going to be used for some kind of iteration, while that's not clear with the while loop. Two is the capability/tractability tradeoff: the more limited a construct is, the easier it is to understand and work with. A foreach loop can do less than a while loop, but it's much easier to guarantee that it processes each item only once.

Programming languages evolve to replace the all-powerful "low-level" constructs with more limited constructs that reduce to the low-level ones. It's why you get things like interfaces over inheritance.

The corollary of this is that if you use a powerful construct, it should be because the limited one can't do what you want. Don't force an infinite for-loop!

Extensions to for loops

For loops reduce to while loops, while loops reduce to goto, everything reduces to goto. So there are some kinds of looping goto can do that for and while cannot.

They get a little closer to feature-parity by adding flow-control inside the loop: continue for advancing the loop early, break for terminating it. I didn't look into when or how these were added, but I'd guess they come from looking at what goto can do that loops could not, and adding features as needed.

"Imperative programming" is just for loops

I think one of the most basic functions of a programming language is to iterate through an array. So much so that you can distinguish different programming paradigms by how they iterate.

  • Imperative programming: for loops
  • OOP: a foreach method on iterable objects[^python]
  • Functional programming: recursion, reduce (we'll talk about this later)
  • Logic programming: I'm less familiar with this, but I think it's recursion + pattern matching (which ML and Haskell adopted)
  • Array programming: all operations take arrays by default.

I think this explains why arrays are universal while dictionaries are relatively new: there's no obvious basic operation that covers what you do with dicts to the degree that for covers what arrays do.

(Question: do list comprehensions lead to a new paradigm? How do you describe "iteration" in spreadsheet languages?)

Functional replacements

When people talk about replacements for foreach loops in FP, they usually bring up map and filter. But these are not equivalent. Rather map and filter are "preferred" to loops because they are more limited. It's another instance of RLP.

As far as I can tell, reduce is equivalent to foreach. Morally, anyway. This is interesting because reduce is considered part of the "big three" (along with map and filter), but it's also the one that's the "odd one out". It's the hardest for people to understand, and Haskell even has three versions of it. Is it confusing because it's so powerful?

Map and filter are composable in a way that foreach loops are not, but they're also weirdly less "composable" in one very specific way? Consider something of the form

map f (filter p arr))

-->

out = []
for x in arr:
  if p(x):
    out.append(f(x))

The map/filter iterates over the list twice, while the foreach loop iterates only once. I know Haskell can fuse maps together as an optimization but I don't think you safely fuse arbitrary map/filters? I dunno.

You can fuse map and filter with a reduce:

reduce ((out, x) => 
        if p(x) 
        then out ++ f(x)
        else out
       ), [], arr

This gets really awkward though.

I wonder if "reduce is equivalent to for loops" plus "reduce is awkward for fusion" somehow inspired clojure transducers.

I'll be writing this forever

Things I could have looked at in more detail but ran out of time and energy:

  • "Iterators" as a general concept.
  • Loop invariants and variants. This is really relevant to formal methods in particular.
  • Specific implementations of for loops in languages. ALGOL's loop expression-lists, Python's for else, Raku's phasers, javascript's for in/for of distinction, Ada and Chapel loop parallelization constructs... If I keep going I'll get something as long as this.

Following any of those leads would get us to different insights, as opposed to main thing that ended up catching my interest with these notes, which is how the rule of least power applies to language constructs. I should probably write a blog post about it!

Anyway I think this kind of exercise, looking at something everybody takes for granted and try to learn things from it, is really valuable for understanding software better. I don't know how interesting it is to read about, though. Lemme know if you think you'd like to see this for other stuff, or just share the essays that are downstream of the practice (like writing about RLP).


Podcast appearance: are we engineers?

I was on a podcast, talking about the crossover project and what software engineers can learn from traditional engineering. Check it out here!

Why do regexes use `$` and `^` as line anchors?

25 March 2024 at 16:40

Next week is April Cools! A bunch of tech bloggers will be writing about a bunch of non-tech topics. If you've got a blog come join us! You don't need to drive yourself crazy with a 3000-word hell essay, just write something fun and genuine and out of character for you.

But I am writing a 3000-word hell essay, so I'll keep this one short. Last week I fell into a bit of a rabbit hole: why do regular expressions use $ and ^ as line anchors?1

This talk brings up that they first appeared in Ken Thompson's port of the QED text editor. In his manual he writes:

b) "^" is a regular expression which matches character at the beginning of a line.

c) "$" is a regular expression which matches character before the character <nl> (usually at the end of a line)

QED was the precursor to ed, which was instrumental in popularizing regexes, so a lot of its design choices stuck.

Okay, but then why did Ken Thompson choose those characters?

I'll sideline ^ for now and focus on $. The original QED editor didn't have regular expressions. Its authors (Butler Lampson and Peter Deutsch) wrote an introduction for the ACM. In it they write:

Two minor devices offer additional convenience. The character "." refers to the current line and the character "$" to the last line in the buffer.

So $ already meant "the end of the buffer", and Ken adapted it to mean "the end of the line" in regexes.

Okay, but then why did Deutsch and Lampson use $ for "end of buffer"?

Things get tenuous

The QED paper mentions they wrote it for the SDS-930 mainframe. Wikipedia claims (without references2) that the SDS-930 used a Teletype Model 35 as input devices. The only information I can find about the model 35 is this sales brochure, which has a blurry picture of the keyboard:

Blurry picture of a keyboard

I squinted at it really hard and saw that it's missing the []{}\|^_@~ symbols. Of the remaining symbols, $ is by far the most "useless": up until programming it exclusively meant "dollars", whereas even something like # meant three different things. But also, $ is so important in business that every typewriter has one. So it's a natural pick as the "spare symbol".

Yes this is really tenuous and I'm not happy with it, but it's the best answer I got.

If we're willing to stick with it, we can also use it to explain why Ken chose ^ to mean "beginning of line". ^ isn't used in American English, and the only reason QED wasn't using it was because it wasn't on the Teletype Model 35. But Ken's keyboard did have ^, even when it wasn't standardized at the time, so he was able to use it.

(Why did it have ^? My best guess is that's because ASCII-67 included it as a diacritic and keyboards were just starting to include all of the ASCII characters. The Teletype 35 brochure says "it follows ASCII" but didn't include many of the symbols, just uses the encoding format.)

So there you have it, an explanation for the regex anchors that kinda makes sense. Remember, April Cools next week!


  1. I had this question as part of a larger rabbit hole on the history of regular expressions, which might become a full blog post at some point in May. 

  2. I have a pathological hatred of secondary sources. 

What if the spec doesn't match the code?

19 March 2024 at 16:23

Whenever I talk about formal methods, I get the same question:

Can I use the spec to generate my code?

People are worried about two things. One, that they'll make a mistake implementing the specification and have bugs. Two, that over time the implementation will "drift" and people won't update the spec, leading to unexpected design issues.

These are valid fears, and both are things that happen when writing specifications! Which is why it extra sucks that my answer is usually "you can't". This newsletter will be a quick explanation as to why, plus why it's still a net win to write a spec, and an alternate trick you can do instead to keep things in sync.

Synchronization is too expensive

There's two ways we can keep code in sync with a spec. The first way is program extraction, where you take a spec and generate a complex program from it. Lots of theorem provers, like Coq and Agda, have this functionality. The second way is refinement, where you write code independently and then prove it somehow matches the spec. This is what the seL4 people did with their C code.

Extractable/refineable spec are much, much harder to write than an "unsynced" spec because you lose a level of abstraction. Specifications can exist at a higher level of abstraction than a program can, which is why they can be so much more concise and focused than a program can. For example, here's a quick PlusCal snippet1 for a reader pulling a value off a queue and adding it to that reader's total:

await Nonempty(queue);
with r \in Readers do
     total[r] := total[r] + Head(queue);
     queue := Tail(queue);
end with;

What would that look like in a program? You'd probably have code for setting up the readers, code for reading from the queue, code for persisting the value of total somewhere. If it's in another system like Postgres or redis, you'd have a client library for that, too. And around all of that you'd have logging, heartbeats, event hooks, a whole bunch of orthogonal concerns.

You might not even want to implement the whole spec! This is the machine vs world dichotomy: the spec can encode both the system you're designing and also the environment around that system. Let's say you're using AWS SQS for the queue, so there's a chance you can't connect to it, and a chance it duplicates a message. You can't implement either of those things in your system, only the code that detects and reacts to them. But adding those are par for the course in specification. It might look like this:

await Nonempty(queue);
with r \in Readers do
    either
        \* unreachable!
        status[r] := "error";
    or
        \* duplicate!
        total[r] := total[r] + Head(queue);
    or
        \* happy path
        total[r] := total[r] + Head(queue);
        queue := Tail(queue);
    end either;
end with;

To synchronize spec and code, you need to bring the spec closer to the level of code, which means losing a lot of abstraction power. It's definitely possible, but considerably more expensive. Synchronization is not affordable for most companies.

But specs are still worth it

Even if you can't keep the spec in sync, it's still a net benefit. Consider what happens if you have a bug in production:

  • With a spec: either you didn't implement the design correctly, or an assumption in the design ("the queue is perfectly reliable") was invalid
  • Without a spec: you didn't implement the design correctly, an assumption in the design was invalid, or the design itself has a flaw.

That last one is the most dangerous and the most expensive to fix — this talk estimated that one design flaw cost their team a year of rework. They're also really hard to correctly diagnose without a spec. I've often seen people think a design bug was actually an implementation bug. They'd "fix" the bug and it'd go away for just long enough for everybody to forget about it before suddenly resurfacing.

And then you gotta make sure that the new design isn't buggy, either.

That's why formal methods are a net win, even without syncing with code. With a verified design you can build systems faster and you can fix bugs faster, because you're confident that the bugs you got are "only" implementation errors.2

We're working on it

A confession: the "this is impossible" section was glazing over a lot of interesting work.

First of all, cheaply generating code from specs might not be viable in general, but it's possible for specific subsets of specification. This is mostly the realm of research projects like PGo, which converts (a variant of) PlusCal to Go code. You can see some examples here. I don't think it's going to be production ready for a long time, but at least it's being worked on.

A more exciting approach is generating tests from specs. I know a few companies who've done this, with the most public example being MongoDB. They discuss their approach in eXtreme modeling in practice:

We wrote a new TLA+ specification for this purpose and used the state space explored by the TLC model checker to generate C++ and Golang test cases for aspects of the OT algorithm. These test cases can be automatically generated and run after any change to the model, and ensure that both the C++ and Golang implementations make appropriate changes to maintain equivalence.

More specifically, they took a TLA+ spec and dumped the state graph, then decomposed it into a set of linear behaviors and converted those into about 5,000 tests. This works better than code generation because tests are, in a strange way, "more abstract" than code. Lots of implementations pass the same test suite. So it's less of a jump.

The TLA+ model checker has a bunch of features to make test generation easier: simulation mode lets you grab random traces and ALIAS can fine-tune the trace outputs. There's also an IO extension for calling subprocesses from model checking, which can be used to drive a test harness.

Two important caveats. One, the MongoDB team wrote a bunch of custom code to make this work; there's no off-the-shelf solution for turning specs into tests. Second, they were specifying a low-level, deterministic program, so were able to keep the TLA+ relatively close to the code. Writing a test harness for a distributed system is a lot harder. It's still a promising avenue!


Comments Section

I got a lot of responses to my graph types essay. I compiled some of the most interesting ones into a "Comments Section", here. There's some cool stuff about datalog and Rubik's Cubes there!


  1. PlusCal is a DSL that compiles to pure TLA+. I'm using it here because the equivalent TLA+ constructs are a little harder for outsiders to read. 

  2. One weak point in this is the "An assumption in the design was wrong" problem, but generally it's much easier to adapt a working design that fails under one broken assumption than an ad hoc design that might not have worked in the first place. 

Be Suspicious of Success

16 October 2024 at 15:08

From Leslie Lamport's Specifying Systems:

You should be suspicious if [the model checker] does not find a violation of a liveness property... you should also be suspicious if [it] finds no errors when checking safety properties.

This is specifically in the context of model-checking a formal specification, but it's a widely applicable software principle. It's not enough for a program to work, it has to work for the right reasons. Code working for the wrong reasons is code that's going to break when you least expect it. And since "correct for right reasons" is a much narrower target than "correct for any possible reason", we can't assume our first success is actually our intended success.

Hence, BSOS: Be Suspicious of Success.

Some useful BSOS practices

The standard way of dealing with BSOS is verification. Tests, static checks, model checking, etc. We get more confident in our code if our verifications succeed. But then we also have to be suspicious of that success, too! How do I know whether my tests are passing because they're properly testing correct code or because they're failing to test incorrect code?

This is why test-driven development gurus tell people to write a failing test first. Then at least we know the tests are doing something (even if they still might not be testing what they want).

The other limit of verification is that it can't tell us why something succeeds. Mainstream verification methods are good at explaining why things fail— expected vs actual test output, type mismatches, specification error traces. Success isn't as "information-rich" as failure. How do you distinguish a faithful implementation of is_collatz_counterexample from return false?

A broader technique I follow is make it work, make it break. If code is working for the right reasons, I should be able to predict how to break it. This can be either a change in the runtime (this will livelock if we 10x the number of connections), or a change to the code itself (commenting out this line will cause property X to fail). 1 If the code still works even after the change, my model of the code is wrong and it was succeeding for the wrong reasons.

Happy and Sad Paths

A related topic (possibly subset?) is "happy and sad paths". The happy path of your code is the behavior when everything's going right: correct inputs, preconditions are satisfied, the data sources are present, etc. The sad path is all of the code that handles things going wrong. Retry mechanisms, insufficient user authority, database constraint violation, etc. In most software, the code supporting the sad paths dwarfs the code in the happy path.

BSOS says that I can't just show code works in the happy path, I also need to check it works in the sad path.

BSOS also says that I have to be suspicious when the sad path works properly, too.

Say I add a retry mechanism to my code to handle the failure mode of timeouts. I test the code and it works. Did the retry code actually run? Did it run regardless of the original response? Is it really doing exponential backoff? Will stop after the maximum retry limit? Is the sad path code after the maximum retry limit working properly?

One paper found that 35% of catastrophic distributed system failures were caused by "trivial mistakes in error handlers" (pg 9). These were in mature, battle-hardened programs. Be suspicious of success. Be more suspicious of sad path success.


Blog Rec

This week's blog rec is Red Blob Games!2 While primarily about computer game programming, the meat of the content is beautiful, interactive guides to general CS algorithms. Some highlights:

(I don't think his rss feed covers new interactive articles, only the blog specifically.)


  1. Jimmy Koppel once proposed that just as code has properties, code variations have superproperties. For example, "no modification to the codebase causes us to use a greater number of deprecated APIs." 

  2. Okay, it's more an article site, because there's also a Red Blob blog (which covers a lot of neat stuff, too). Maybe I should just rename this section to "site rec". 

How to convince engineers that formal methods is cool

8 October 2024 at 16:18

Sorry there was no newsletter last week! I got COVID. Still got it, which is why this one's also short.

Logic for Programmers v0.4

Now available! This version adds a chapter on TLA+, significantly expands the constraint solver chapter, and adds a "planner programming" section to the Logic Programming chapter. You can see the full release notes on the book page.

How to convince engineers that formal methods is cool

I have an open email for answering questions about formal methods,1 and one of the most common questions I get is "how do I convince my coworkers that this is worth doing?" usually the context is the reader is really into the idea of FM but their coworkers don't know it exists. The goal of the asker is to both introduce FM and persuade them that FM's useful.

In my experience as a consultant and advocate, I've found that there's only two consistently-effective ways to successfully pitch FM:

  1. Use FM to find an existing bug in a work system
  2. Show how FM finds a historical bug that's already been fixed.

Why this works

There's two main objections to FM that we need to address. The first is that FM is too academic and doesn't provide a tangible, practical benefit. The second is that FM is too hard; only PhDs and rocket scientists can economically use it. (Showing use cases from AWS et al aren't broadly persuasive because skeptics don't have any insight into how AWS functions.) Finding an existing bug hits both: it helped the team with a real problem, and it was done by a mere mortal.

Demonstrating FM on a historical bug isn't as effective: it only shows that formal methods could have helped, not that it actually does help. But people will usually remember the misery of debugging that problem. Bug war stories are popular for a reason!

Making historical bugs persuasive

So "live bug" is a stronger rec, but "historical bug" tends to be easier to show. This is because you know what you're looking for. It's easier to write a high-level spec on a system you already know, and show it finds a bug you already know about.

The trick to make it look convincing is to make the spec and bug as "natural" as possible. You can't make it seem like FM only found the bug because you had foreknowledge of what it was— then the whole exercise is too contrived. People will already know you had foreknowledge, of course, and are factoring that into their observations. You want to make the case that the spec you're writing is clear and obvious enough that an "ignorant" person could have written it. That means nothing contrived or suspicious.

This is a bit of a fuzzy definition, more a vibe than anything. Ask yourself "does this spec look like something that was tailor-made around this bug, or does it find the bug as a byproduct of being a regular spec?"

A good example of a "natural" spec is the bounded queue problem. It's a straight translation of some Java code with no properties besides deadlock checking. Usually you'll be at a higher level of abstraction, though.


Blog rec: arg min

This is a new section I want to try for a bit: recommending tech(/-adjacent) blogs that I like. This first one is going to be a bit niche: arg min is writing up lecture notes on "convex optimization". It's a cool look into the theory behind constraint solving. I don't understand most of the math but the prose is pretty approachable. Couple of highlights:

The blogger also talks about some other topics but I haven't read those posts much.


  1. As always, talking primarily about formal specification of systems (TLA+/Alloy/Spin), not formal verification of code (Dafny/SPARK/Agda). I talk about the differences a bit here (but I really need to write a more focused piece). 

Refactoring Invariants

24 September 2024 at 20:06

(Feeling a little sick so this one will be short.)

I'm often asked by clients to review their (usually TLA+) formal specifications. These specs are generally slower and more convoluted than an expert would write. I want to fix them up without changing the overall behavior of the spec or introducing subtle bugs.

To do this, I use a rather lovely feature of TLA+. Say I see a 100-line Foo action that I think I can refactor down to 20 lines. I'll first write a refactored version as a separate action NewFoo, then I run the model checker with the property

RefactorProp ==
    [][Foo <=> NewFoo]_vars

That's an intimidating nest of symbols but all it's saying is that every Foo step must also be a NewFoo step. If the refactor ever does something different from the original action, the model-checker will report the exact behavior and transition it fails for. Conversely, if the model checker passes, I can safely assume they have identical behaviors.

This is a refactoring invariant:1 the old and new versions of functions have identical behavior. Refactoring invariants are superbly useful in formal specification. Software devs spend enough time refactoring that they'd be useful for coding, too.

Alas, refactoring invariants are a little harder to express in code. In TLA+ we're working with bounded state spaces, so the model checker can check the invariant for every possible state. Even a simple program can have an unbounded state space via an infinite number of possible function inputs.

(Also formal specifications are "pure" simulations while programs have side effects.)

The "normal" way to verify a program refactoring is to start out with a huge suite of oracle tests. This should catch a bad refactor via failing tests. The downside is that you might not have the test suite in the first place, or not one that covers your particular refactoring. Second, even if the test suite does, it only indirectly tests the invariant. It catches the refactoring error as a consequence of testing other stuff. What if we want to directly test the refactoring invariant?

Two ways of doing this

One: by pulling in formal methods. Ray Myers has a neat video on formally proving a refactoring is correct. That one's in the niche language ACL2, but he's also got one on refactoring C. You might not even to prove the refactoring correct, you could probably get away with using an SMT solver to find counterexamples.

Two: by using property-based testing. Generate random inputs, pass them to both functions, and check that the outputs are identical. Using the python Hypothesis library:

from hypothesis import given
import hypothesis.strategies as st

# from the `gilded rose kata`
def update_quality(list[Item]):
    ...

def update_quality_new(list[Item]):
    ...

@given(st.lists(st.builds(Item)))
def test_refactoring(l):
    assert update_quality(l) == update_quality_new(l)

One tricky bit is if the function is part of a long call chain A -> B -> C, and you want to test that refactoring C' doesn't change the behavior of A. You'd have to add a B' that uses C' and then an A' that uses B'. Maybe you could instead create a branch, commit the change the C' in that branch, and then run a cross-branch test against each branch's A.

Impure functions are harder. The test now makes some side effect twice, which could spuriously break the refactoring invariant. You could instead test the changes are the same, or try to get the functions to effect different entities and then compare the updates of each entity. There's no general solution here though, and there might be No Good Way for a particular effectful refactoring.

Behavior-changing rewrites

We can apply similar ideas for rewrites that change behavior. Say we have an API, and v1 returns a list of user names while v2 returns a {version, userids} dict. Then we can find some transformation of v2 into v1, and run the refactoring invariant on that:

def v2_to_v1(v2_resp):
    return [User(id).name for user in v2_resp["userids"]]

@given(some_query_generator)
def test_refactoring(q):
    assert v1(q) == v2_to_v1(v2(q))

Fun fact: v2_to_v1 is a software homomorphism!


  1. Well technically it's an action property since it's on the transitions of states, not the states, but "refactor invariant" gets the idea across better. 

Goodhart's Law in Software Engineering

17 September 2024 at 16:33

Blog Hiatus

You might have noticed I haven't been updating my website. I haven't even looked at any of my drafts for the past three months. All that time is instead going into Logic for Programmers. I'll get back to the site when that's done or in 2025, whichever comes first. Newsletter and Patreon will still get regular updates.

(As a comparison, the book is now 22k words. That's like 11 blog posts!)

Goodhart's Law in Software Engineering

I recently got into an argument with some people about whether small functions were mostly a good idea or always 100% a good idea, and it reminded me a lot about Goodhart's Law:

When a measure becomes a target, it ceases to be a good measure.

The weak version of this is that people have perverse incentives to game the metrics. If your metric is "number of bugs in the bug tracker", people will start spuriously closing bugs just to get the number down.

The strong version of the law is that even 100% honest pursuit of a metric, taken far enough, is harmful to your goals, and this is an inescapable consequence of the difference between metrics and values. We have metrics in the first place because what we actually care about is nonquantifiable. There's some thing we want more of, but we have no way of directly measuring that thing. We can measure something that looks like a rough approximation for our goal. But it's not our goal, and if we replace the metric with the goal, we start taking actions that favor the metric over the goal.

Say we want more reliable software. How do you measure "reliability"? You can't. But you can measure the number of bugs in the bug tracker, because fewer open bugs roughly means more reliability. This is not the same thing. I've seen bugs fixed in ways that made the system less reliable, but not in ways that translated into tracked bugs.

I am a firm believer in the strong version of Goodhart's law. Mostly because of this:

A peacock with its feathers out. The peacock is scremming

What does a peahen look for in a mate? A male with maximum fitness. What's a metric that approximates fitness? How nice the plumage is, because nicer plumage = more calories energy to waste on plumage.1 But that only approximates fitness, and over generations the plumage itself becomes the point at the cost of overall bird fitness. Sexual selection is Goodhart's law in action.

If the blind watchmaker can fall for Goodhart, people can too.

Examples in Engineering

Goodhart's law is a warning for pointy-haired bosses who up with terrible metrics: lines added, feature points done, etc. I'm more interested in how it affects the metrics we set for ourselves that our bosses might never know about.

  • "Test coverage" is a proxy for how thoroughly we've tested our software. It diverges when we need to test lots of properties of the same lines of code, or when our worst bugs are emergent at the integration level.
  • "Cyclomatic complexity" and "function size" are proxies for code legibility. They diverges when we think about global module legibility, not local function legibility. Then too many functions can obscure the code and data flow.
  • Benchmarks are proxies for performant programs, and diverge when improving benchmarks slows down unbenchmarked operations.
  • Amount of time spent pairing/code reviewing/debugging/whatever proxies "being productive".
  • The DORA report is an interesting case, because it claims four metrics2 are proxies to ineffable goals like "elite performance" and employee satisfaction. It also argues that you should minimize commit size to improve the DORA metrics. A proxy of a proxy of a goal!

What can we do about this?

No, I do not know how to avoid a law that can hijack the process of evolution.

The 2023 DORA report suggests readers should avoid Goodhart's law and "assess a team's strength across a wide range of people, processes, and technical capabilities" (pg 10), which is kind of like saying the fix to production bugs is "don't write bugs". It's a guiding principle but not actionable advice that gets to that principle.

They also say "to use a combination of metrics to drive deeper understanding" (ibid), which makes more sense at first. If you have metrics X and Y to approximate goal G, then overoptimizing X might hurt Y, indicating you're getting further from G. In practice I've seen it turn into "we can't improve X because it'll hurt Y and we can't improve Y because it'll hurt X." This could mean we're at the best possible spot for G, but more often it means we're trapped very far from our goal. You could come up with a weighted combination of X and Y, like 0.7X + 0.3Y, but that too is a metric subject to Goodhart.

I guess the best I can do is say "use your best engineering judgement"? Evolution is mindless, people aren't. Again, not an actionable or scalable bit of advice, but as I grow older I keep finding "use your best judgement" is all we can do. Knowledge work is ineffable and irreducible.


  1. This sent me down a rabbit hole; turns out scientists are still debating what exactly the peacock's tail is used for! Is it sexual selection? Adverse signalling? Something else??? 

  2. How soon commits get to production, deployment frequency, percent of deployments that cause errors in production, and mean time to recovery. 

Why Not Comments

10 September 2024 at 19:40

Logic For Programmers v0.3

Now available! It's a light release as I learn more about formatting a nice-looking book. You can see some of the differences between v2 and v3 here.

Why Not Comments

Code is written in a structured machine language, comments are written in an expressive human language. The "human language" bit makes comments more expressive and communicative than code. Code has a limited amount of something like human language contained in identifiers. "Comment the why, not the what" means to push as much information as possible into identifiers. Not all "what" can be embedded like this, but a lot can.

In recent years I see more people arguing that whys do not belong in comments either, that they can be embedded into LongFunctionNames or the names of test cases. Virtually all "self-documenting" codebases add documentation through the addition of identifiers.1

So what's something in the range of human expression that cannot be represented with more code?

Negative information, drawing attention to what's not there. The "why nots" of the system.

A Recent Example

This one comes from Logic for Programmers. For convoluted technical reasons the epub build wasn't translating math notation (\forall) into symbols (). I wrote a script to manually go through and replace tokens in math strings with unicode equivalents. The easiest way to do this is to call string = string.replace(old, new) for each one of the 16 math symbols I need to replace (some math strings have multiple symbols).

This is incredibly inefficient and I could instead do all 16 replacements in a single pass. But that would be a more complicated solution. So I did the simple way with a comment:

Does 16 passes over each string
BUT there are only 25 math strings in the book so far and most are <5 characters.
So it's still fast enough.

You can think of this as a "why I'm using slow code", but you can also think of it as "why not fast code". It's calling attention to something that's not there.

Why the comment

If the slow code isn't causing any problems, why have a comment at all?

Well first of all the code might be a problem later. If a future version of LfP has hundreds of math strings instead of a couple dozen then this build step will bottleneck the whole build. Good to lay a signpost now so I know exactly what to fix later.

But even if the code is fine forever, the comment still does something important: it shows I'm aware of the tradeoff. Say I come back to my project two years from now, open epub_math_fixer.py and see my terrible slow code. I ask "why did I write something so terrible?" Was it inexperience, time crunch, or just a random mistake?

The negative comment tells me that I knew this was slow code, looked into the alternatives, and decided against optimizing. I don't have to spend a bunch of time reinvestigating only to come to the same conclusion.

Why this can't be self-documented

When I was first playing with this idea, someone told me that my negative comment isn't necessary, just name the function RunFewerTimesSlowerAndSimplerAlgorithmAfterConsideringTradeOffs. Aside from the issues of being long, not explaining the tradeoffs, and that I'd have to change it everywhere if I ever optimize the code... This would make the code less self-documenting. It doesn't tell you what the function actually does.

The core problem is that function and variable identifiers can only contain one clause of information. I can't store "what the function does" and "what tradeoffs it makes" in the same identifier.

What about replacing the comment with a test. I guess you could make a test that greps for math blocks in the book and fails if there's more than 80? But that's not testing EpubMathFixer directly. There's nothing in the function itself you can hook into.

That's the fundamental problem with self-documenting negative information. "Self-documentation" rides along with written code, and so describes what the code is doing. Negative information is about what the code is not doing.

End of newsletter speculation

I wonder if you can think of "why not" comments as a case of counterfactuals. If so, are "abstractions of human communication" impossible to self-document in general? Can you self-document an analogy? Uncertainty? An ethical claim?


  1. One interesting exception someone told me: they make code "more self-documenting" by turning comments into logging. I encouraged them to write it up as a blog post but so far they haven't. If they ever do I will link it here. 

Thoughts on "The Future of TLA+"

4 September 2024 at 17:00

Last week Leslie Lamport posted The Future of TLA+, saying "the future of TLA+ is in the hands of the TLA+ foundation". Lamport released TLA+ in 1999 and shepherded its development for the past 25 years. Transferring ownership of TLA+ to the official foundation has been in the works for a while now, and now it's time.

In the document, Lamport also talks about some of his hopes for how the foundation will think about evolving TLA+. He gives three possible costs of any change:

  1. Breaking existing specifications
  2. Complicating the language. "Simplicity is a major goal of TLA+".
  3. Complicating the tooling. "There are few people contributing to the maintenance of the tools."

With this, Lamport proposes removing several features, like octal numbers, submodules and tuple quantification.1

I would be sad if we lost tuple quantification, less so if we lost octal numbers. But what could be added to make TLA+ better? Features that don't just "save a few keystrokes" but significantly prove the expressiveness of our specifications.

Some disclaimers. First, where I'm coming from: most of my contracts are either "get beginners started with TLA+" or "help a company write an exotic spec at the limit of what TLA+ can do." So I'm always thinking about how to teach the language better and how to abuse its most obscure features. Did you know about labeled subexpressions? You can do magic with labeled subexpressions.

So I both want the skill floor lowered and the skill ceiling raised. This is often a contradictory goal.

Second, these aren't official proposals or anything. I'm just sharing what I'd like to see, given infinite resources and no concerns about tradeoffs.

Finally, most of these aren't changes to TLA+. They're for TLC.

TLA+ vs TLC

TLA+ is the specification language. TLC is the model checker used to check that specs written in that language satisfy certain properties. It's not the only tool; you could use TLAPS or Apalache instead. But TLC is the most mature tool and the only one bundled with the standard distribution.

This is a critical distinction for two reasons. One, TLC can only model-check a subset of TLA+. There are a set of four "forbidden operators" that cannot be used in any capacity (though the lead TLC developer [Markus Kuppe] is working on adding one). Lamport lists them in section 3.3 of his essay.

Two, TLC makes some decisions about how to resolve some unspecified TLA+ operations. For example, CHOOSE x \in set: P(x) is undefined if no value in set satisfies P (Specifying Systems pg 73). In TLC, it's an error:

Attempted to compute the value of an expression of form
CHOOSE x in S: P, but no element of S satisfied P.

This catches a lot of design bugs where you expected something to exist but it doesn't. But it also makes specifying defaults awkward. What if I want to get the record with a certain id, but if there isn't one, return a default value NoRecord? Then I have to write this:

RetrieveRecord(id) ==
    IF \E r \in records: r.id = id
    THEN CHOOSE r \in records: r.id = id
    ELSE NoRecord

If the condition is longer, this gets more cumbersome, and there's always a risk of them getting out of sync.

The upside of TLC being separate from TLA+ is that we can add features to TLC without changing the overall semantics of TLA+. That's what Markus is doing in the TLC module: things like TLCGet and Assert adds new features to model checking TLA+ without adding new syntax to TLA+. It should be possible to make a new operator Try(Op, Fail) that evaluates Op and, if it fails, returns Fail instead. Then we can just do

RetrieveRecord(id) ==
    Try(CHOOSE r \in records: r.id = id, NoRecord)

With that in mind, most of the things I want in TLA+ are actually things I want in TLC, with one exception.

Things I wished were in TLA+/TLC

Try(Op, Fail)

It'd also be nice for equality checking. Right now "2" = 2 raises an error, even if I want it to be false.

(Could nested Trys lead to ambiguous behavior? Something to think about.)

A more obvious set filtering

This is the only change here that would affect TLA+ itself and not just TLC. This is map and filter in TLA+:

\* Map
{M(x): x \in set}
\* Filter
{x \in set: F(x)}

Beginners can never remember which is which. Even experienced people often forget which is which! I'd like a different keyword for each, so we could instead write

\* Filter
{x \in set WHERE F(x)}
\* Mapfilter
{M(x): x \in set WHERE F(x)}

Actual keyword may vary.

A linter

In the past few years TLA+ tooling has gotten a lot better. Among other things, there's now a debugger, a treesitter plugin, and an interactive spec explorer. I hope we soon see a linter, something that could detect TLA+ snippets with valid-but-unexpected behavior. In particular, something that could detect this:

/\ A
/\ B
 => C

That extra space before => means its parsed as A /\ (B => C), not as the expected (A /\ B) => C. I once lost a day to this. I know at least three other people who also lost a day to this. It is 100% valid TLA+ and we desperately need a tool to tell us when it happens because my gosh is it a morale-killer.

Support for the forbidden operators

There are four operators that are unsupported by TLC and virtually unused in any specs: \cdot, \EE, \AA, and -+->. I have no idea how to use \cdot but Markus is steadily adding support for it. \AA exists for completeness and even Lamport can't think of a good use for it. -+-> could be useful (It's the LTL weak release), but nobody's paid it any attention.

\EE is the interesting one, in that we know exactly how useful it would be. \EE x: P(x) is the temporal quantifier and is defined as "there exists a sequence of xs that make the temporal formula P(x) true for every step of the behavior. Critically, x can be a different value in each step, whereas \E x: P(x) would require the same value in each step.

Why is this so important? One word: refinement.

Say we have an abstract server model and we want make a more detailed version of just the server. We can refine it by instantiating the abstract spec like this:

Abstract == INSTANCE AbstractServer 
    WITH abstract_var1 <- formula1,
         abstract_var2 <- formula2,
         \* ...

Refinement == Abstract!Spec

See the above link for more details. The trouble is that we need to apply substitutions for every variable. If the abstract spec models both the server and the client, our refinement also needs to refine the client state, too! And more frustratingly, if the abstract spec has some kind of bookkeeping variable, say a state-sweeper or auxiliary variable, you have to refine that, too.

But TLA+ provides an out here. Using \EE, we can write the refinement like this:

Abstract(aux_var, client_var) == INSTANCE AbstractServer 
    WITH abstract_var1 <- formula1,
         abstract_var2 <- formula2,
         \* ...

Refinement == \EE a, c: Abstract(a, c)!Spec

Lamport calls this variable hiding (SS pg 41).

Now I don't know much about the internals of TLC, but my best guess is that \EE would be extraordinarily expensive to check. I think a regular refinement you can check as you go, since you can generate the full abstract transition at the same time you generate the main spec's transition. But with variable hiding, you only have information to create part of the abstract transition. I think the only way to do it would be to generate the abstract state graph first and then look for a transition that matches your current transition.

That is at the very least an NP-Complete problem, one you have to compute for every edge in your refinement graph. But maybe there are some special cases (like aux variables that do not affect behavior) which would be cheaper to check?

\E-stealing

This one's both the most niche and probably the hardest to implement, so I don't expect to see it anytime soon. But this is my newsletter and it bothers me specifically.

In Composing TLA+ Specifications with State Machines I used this trick to compose two specs:

Action1 ==
    /\ state = "start"
    /\ DoThing
    /\ state' = "wait"

Sync ==
    LET i == << state, state' >> IN
    CASE i == << "start", "wait" >> ->
        SomeSyncRules
      [] << "wait, start" >> ->
        DifferentSyncRules

So far, so good. But what if I take a nondeterministic action?

Action2 ==
    /\ state = "wait"
    /\ \E x \in set:
        DoThing(x)
    /\ state' = "start"


Sync ==
    LET i == << state, state' >> IN
    CASE i == << "start", "wait" >> ->
        SomeSyncRules
      [] << "wait, start" >> ->
        ???

There's no way for the body of Sync to know which value Action2 picked for x. I've found Sync actions are a great way to compose specifications, and this "opaqueness" limits how strong it can be. It'd be great to have some way of side-channeling out what x was picked, without changing the essential semantics of Action2. Preferably without even needing to modify it.

This might also be useful for test case generation, since you can augment the output traces with more information.

I have no idea what this would look like or how it would behave. It's just a pipe dream of mine.

Other pipe dreams

A better function update syntax. Some way to partially refine one spec into multiple, seamlessly upgrading their individual variables into function variables (without too much boilerplate). Being able to inline a config file in a spec so you don't need two separate files.[^inlining] Being able to call tlc without a config file and passing in flags. Support for -+->, just for giggles. Clock variables. Non-latex ASCII syntax. UNCHANGED (all vars except <>). Parallelized liveness checking. A tree that grants all my wishes.

[^inlining]: This is now provisionally possible! You can see an example here, run it with tlc -config Github866.tla Github866.tla

Comment Section

Buttondown now has comments! I'm experimentally enabling it for now, so you should be able to comment on the email's archive page. Again, this is experimental. If I end up having to spend all my time community moderating, comments are going down again. Be nice to each other.


  1. CHOOSE << x, y >> \in set \X set: x < y

State and time are the same thing

27 August 2024 at 18:13

Time is state

Imagine I put an ordinary ticking quartz clock in an empty room. I walk in, and ten minutes later I walk out with two photograph prints.1 In the 1st one, the second hand is pointing at the top of the clock, in the 2nd it's pointing at the bottom. Are these two copies of the same photo, or are they two different pictures?

A quartz clock with minute and second hands

There's no trick here, the answer is "different photos". Since the clock looks different, time must have passed between the two. More formally, we can represent the clock as a state vector of (hours, minutes, seconds). Since the two pictures have different state vectors, they must represent different photos.

Now I repeat the process and come out with two prints, both showing the same time. Are these two copies of the same photo or two different photos?

It's possible for the pictures to be different but there's no way to be certain. I could have taken the two photos half a second apart so that time passed but the clock didn't tick yet. There's no observational difference between "time didn't pass" and "time passed but the state vector is the same". We can model time only passing in one second increments, as any shorter passage of time is not reflected in the state vector.

Things would be different if you had access to the clock internals. The clock is powered by a quartz crystal that oscillates at approximately 2^15 hz, and a digital circuit inside the clock is counting that number as "one second". If you could read the memory inside the clock, then you could distinguish "00:00:15" and "00:00:15 + 2^14 oscillations".

But in our current system that state is internal to the watch. Until the circuits turn the internal value into an observable value, we cannot recognize the passage of time for the clock.

The only way we can see the passage of time is by measuring changes in observable state.

State is time

Pseudocode snippet:

void f(int &c) {
    *c += 2;
}

I'm trying to say that we pass a reference of a variable into f, so int x = 1; f(x); print(x) will output 3. Calling f(x) permanently splits our program into two eras. Before f(x), all calculations with x will get one thing. Anno Effexi, all calculations with x will get another thing. The update to x advanced time.

Now a more complicated case:

void f(int &c) {
    *c++;
    *c++;
}

Does this advance time one step or two? Depends on whether the program is single-threaded or concurrent. If it's single threaded, when f executes there's nothing else that can read x in between the two updates, so the first mutation is "internal". Externally it looks like there was only one observable mutation x += 2. On the other hand, if the program is concurrent, it's possible for some other thread or w/e to read x in between the two statements of f. That makes the internal mutation observable, so time is advanced twice. There are now three eras of our program with different possible behaviors.

Changes in state matter in that they create new times.

The consequences

Some function programmers stay "shared mutable state is the enemy". I think it's more like "time is the enemy", and time represents itself as mutable state. If a state update is purely internal and cannot affect the observable state (such as a statement in an uncommitted transaction), then it does not advance time.

I find this a good model to reason about abstract systems. It also motivates formal methods concepts like bisimulation and refinement, which I really need to do an updated post on.


  1. I've got a GendankenWorks instant digital camera that takes digital pictures but lets me print them too. I'm sure someone sells something like this. 

An idea for teaching formal methods better

21 August 2024 at 16:05

I was recently commissioned by a company to make a bespoke TLA+ workshop with a strong emphasis on reading specifications. I normally emphasize writing specs, so this one will need a different approach.

While working on it, I had an idea that might make teaching TLA+— and other formal methods— a little easier.

Pseudospecs

There are two problems to reading a spec:

  1. Learning the mental model of what TLA+ is actually doing
  2. Learning how to actually read TLA+.

The second problem is way, way bigger than it seems, because TLA+ has immense expressive power packed in a scary-looking syntax. Like take this spec:

---- MODULE foo ----
CONSTANT Servers
VARIABLE counters

Init ==
    counters = [s \in Servers |-> 0]

Zero(s) ==
    /\ counters[s] > 0
    /\ counters' = [counters EXCEPT ![s] = 0]

Inc(s) ==
    counters' = [counters EXCEPT ![s] = @ + 1]

Next ==
  \/ \E s \in Servers:
      \/ Zero(s)
      \/ Inc(s)

Spec == Init /\ [][Next]_counters
====

What's /\? What's ![s]? What's [][Next]_counters? To understand this spec, I have to understand how to read TLA+. This isn't an insurmountable obstacle, because otherwise nobody would know TLA+, but it could be the difference between "10 people learn TLA+" and "7 people learn TLA+".

My idea is to provide the spec along with a pseudospec, which could look like this:

Params {servers}

Initialize with {
   counters: Map(s in Servers: 0)
}

action Zero(s) {
    requires {
      counters[s] > 0
    }
    counters[s] := 0
}

action Inc(s) {
    counters[s] +:= 1
}

action Next {
  pick some s in Servers and { 
    either do Zero(s)
    or     do Inc(s)
  }

(Still working on what to do about the Spec operator.)

Couple things to notice. One: it's not a consistent translation between two languages, but between language and meaning. There's no formal grammar for the pseudospec. I need inconsistency for reasons I'll talk about later.

Two: the translation is not purely local. Things get moved around a little bit. Zero being an action affects how I translate \E.

Three: the translation is kind of metaphrasic. It's close to, but not exactly, a line-by-line translation. A person who has both files open in a split can correlate the spec and pseudospec.1

Most directly, this will help people understand what the spec is doing, which gives me a basis to teach model-checking, safety properties, etc. But I think this will also makes learning the TLA+ syntax easier, by acting like an answer sheet. The student can read the TLA+ and, if they get stuck, look at the pseudospec for meaning.

My moonshot hope is that this also helps with writing specs, by giving people a clearer sense of the standard TLA+ idioms. I can translate the same syntax slightly differently depending on the way it's used, and whether it's a general construct or one for just the spec. That might be doing too much with this concept, though.

Problems I see

First is that there's lots of nuances that would be lost in the translation. In writing

Init ==
   /\ x \in 1..5

-->

Initialize with {
  x: pick some in 1..5
}

I lose the idea that Init isn't special, it's just a boolean operator like anything else. The student will have to unlearn that misconception if they ever need to do tricks like FastInit.

This is an essential tradeoff in pedagogy: is it better to teach the full picture now or the easy thing first and fix the misconception later? The teacher side of me knows that the white lie is better in the long run, but the expert side of me hates this.

Second, not every TLA+ symbol has a 1-1 mapping with English. I did the translation

\E s \in Servers:
  \/ Zero(s)
  \/ Inc(s)

-->

pick some s in Servers and { 
    either do Zero(s)
    or     do Inc(s)
}

That's because Zero and Inc are actions— things that update state. If they were regular operators, If P and Q are regular operators, I'd translate it like this:

exists an s in S {
   P(s) || Q(s)
}

Depending on the context, I'll translate \E in two different ways.

This ties back to problem #1. In TLA+, these are the same thing. Performing an action is literally saying that it's a true description of the next-state relation. This has the consequence of looking like it updates the state relation, but that's only in certain (exceptionally common) circumstances. This is all incredibly elegant and intuitive to a TLA+ expert, but not for a beginner. So I have to introduce some inconsistencies.

Problem #3 is just how much syntax I'll need to translate. What do I do for [A -> B], WF_vars, or \AA? I can try to avoid them for the early specs but can't do that forever.

Overall, though, I think the potential benefit of easier learning will outweigh the drawbacks.

Will this work?

We'll find out!


  1. Experienced TLA+ users might notice that this is inspired by both PlusCal and Quint. I can't just teach those because the goal is to get people use TLA+ proper. (Also we really need a word for "experienced TLA+ user", like "Pythonista" or "Rustacean") 

Texttools dot py

14 August 2024 at 17:19

I make a lot of personal software tools. One of these is "texttools.py", which is easiest to explain with an image:

A GUI with the first three lines of this newsletter in the top, and the bottom has the three lines with markdown quote prefixes

Paste text in the top box, choose a transform, output appears in the bottom box. I can already do most of these transformations in vim, or with one of the many online tools out there, but I prefer my script for two reasons:

  1. There's no context switching. I don't have to leave my current tab or vim buffer and worry about cleanup later. One hotkey opens the GUI, <esc> closes it, with no break in my workflow.
  2. It loads in <10 ms.

It took me like an hour to make and I use it all the time. And it's small enough that I just share the whole script here.

How it works

Texttools is a python script running a tkinter GUI. I used tkinter because it's a builtin; I would generally not recommend it if you have any better options. On the plus side, being a builtin means you don't need to install a package to use this yourself.

import tkinter as tk
from tkinter import N, S, E, W, ttk


# Complex transforms go here,
# Simple transforms are just lambdas
def _wordcount(s: str):
    "Returns a tuple of linecount, wordcount, charcount"
    return (len(s.splitlines()), len(s.split()), len(s))

transforms = [
        # Transforms go here, for example
        {"name": "One line", "transform": lambda x: " ".join(x.splitlines())}
        ,{"name": "Line/Word/Char", "transform": _wordcount}
        ]

class GUI():
    def __init__(self) -> None:
        self.root = tk.Tk()
        self.active_transform = lambda x: x # start with no transform
        self.layout_gui()


    def layout_gui(self) -> None:
        self.mainframe = ttk.Frame(self.root, padding="3 3 12 12")

        self.root.title("Text Tools")
        self.mainframe.grid(column=0, row=0, sticky=(N, W, E, S))
        self.root.columnconfigure(0, weight=1)
        self.root.rowconfigure(0, weight=1)
        self.content_box = tk.Text(self.mainframe, undo=True)
        self.content_box.grid(column=2, row=2)

        self.output_box = tk.Text(self.mainframe, undo=True)
        self.output_box.grid(column=2, row=3)

        self.transform_box = tk.Listbox(self.mainframe)
        self.transform_box.grid(column=1, row=2, rowspan=2)
        for t in transforms:
            self.transform_box.insert(tk.END, t["name"])

        # Keyboard bindings
        self.root.bind("<Escape>", lambda _: self.root.quit())
        self.content_box.bind("<Tab>", lambda _: self.transform_box.focus())

        # vvv makes clicking or pressing enter change the transform
        self.transform_box.bind("<Button-1>", self.select_transform)
        self.transform_box.bind("<Return>", self.select_transform)

        # vvv makes anything typed in update the output
        self.content_box.bind("<Key>", 
                              lambda _: self.root.after(1, self.update))

        self.content_box.focus()

    def update(self):
        content = self.content_box.get('1.0', tk.END)
        content = self.active_transform(content) 
        self.output_box.delete('1.0', tk.END)
        self.output_box.insert('1.0', content)

    def select_transform(self, _):
        try:
            selection = self.transform_box.curselection()[0]
            self.active_transform = transforms[selection]["transform"]
            self.update()
        except (ValueError, IndexError):
            pass

def main(): # Entry point for pyproject.toml
    gui = GUI()
    gui.root.mainloop()

if __name__ == "__main__":
    main()

Man I forget how much I dislike tkinter until I have to look at it again. If you want to add your own text tools, just put a new item in the global transforms array.

To make it easier to run the script, I put it in a "toolkit" repo with this pyproject.toml:

[project]
name="toolkit"
version="0.0.1"
requires-python=">=3.8"

[project.gui-scripts]
toolkit-texttools = "toolkit.texttools:main"

Then running pip install -e . creates a toolkit-texttools binary (or in my case a .exe).

Finally, I wrote an AutoHotKey script so I could load it with a keyboard shortcut:

; & numpagedown + t both pressed at same time
NumpadPgDn & t:: toggle_app("Text Tools", "toolkit-texttools.exe")

I like mapping stuff to the numpad because it's guaranteed to not interfere with any OS or program-specific hotkeys.

Short newsletter this week because I'm still recovering from jetlag. See you all next week!

Why I prefer rST to markdown

31 July 2024 at 15:34

I just published a new version of Logic for Programmers! v0.2 has epub support, content on constraint solving and formal specification, and more! Get it here.

This is my second book written with Sphinx, after the new Learn TLA+. Sphinx uses a peculiar markup called reStructured Text (rST), which has a steeper learning curve than markdown. I only switched to it after writing a couple of books in markdown and deciding I needed something better. So I want to talk about why rst was that something.1

Why rst is better

The most important difference between rst and markdown is that markdown is a lightweight representation of html, while rst is a midweight representation of an abstract documentation tree.

It's easiest to see this with a comparison. Here's how to make an image in markdown:

![alttext](example.jpg)

Technically, you don't even need a parser for this. You just need a regex to transform it into <img alt="alttext" src="example.jpg"/>. Most modern markdown engines do parse this into an intermediate representation, but the essence of markdown is that it's a lightweight html notation.

Now here's how to make an image in rst:

.. image:: example.jpg
  :alt: alttext

.. image:: defines the image "directive". When Sphinx reads it, it looks up the registered handler for the directive, finds ImageDirective, invokes ImageDirective.run, which returns an image_node, which is an object with an alt field containing "alttext". Once Sphinx's processed all nodes, it passes the whole doctree to the HTML Writer, which looks up the rendering function for image_node, which tells it to output an <image> tag.

Whew that's a mouthful. And for all that implementation complexity, we get… an interface that has 3x the boilerplate as markdown.

On the other hand, the markdown image is hardcoded as a special case in the parser, while the rst image is not. It was added in the exact same way as every other directive in rst: register a handler for the directive, have the handler output a specific kind of node, and then register a renderer for that node for each builder you want.

This means you can extend Sphinx with new text objects! Say you that instead of an <image>, you want a <figure> with a <figcaption>. In basic markdown you have to manually insert the html, with Sphinx you can just register a new figure directive. You can even make your FigureDirective subclass ImageDirective and have it do most of the heavy lifting.

The second benefit is more subtle: you can transform the doctree before rendering it. This is how Sphinx handles cross-referencing: if I put a foo anchor in one document and :ref:`image <foo>` in another, Sphinx will insert the right URL during postprocessing. The transformation code is also first-class with the rest of the build process: I can configure a transform to only apply when I'm outputting html, have it trigger in a certain stage of building, or even remove a builtin transform I don't want to run.

Now, most people may not need this kind of power! Markdown is ubiquitous because it's lightweight and portable, and rst is anything but. But I need that power.

One use case

Logic for Programmers is a math-adjacent book, and all good math books need exercises for the reader. It's easier to write an exercise if I can put it and the solution right next to each other in the document. But for readers, I want the solutions to show up in the back of the book. I also want to link the two together, and since I might want to eventually print the book, the pdfs should also include page references. Plus they need to be rendered in different ways for latex (pdf) output and epub output. Overall lots of moving parts.

To handle this I wrote my own exercise extension.

.. in chapter.rst
.. exercise:: Fizzbuzz
  :name: ex-fizzbuzz

  An exercise

.. solution:: ex-fizzbuzz

  A solution

.. in answers.rst

.. solutionlist::

How these nodes are processed depends on my compilation target. I like to debug in HTML, so for HTML it just renders the exercise and solution inline.

When generating epub and latex, though, things works a little differently. After generating the whole doctree, I run a transform that moves every solution node from its original location to under solutionlist. Then it attaches a reference node to every exercise, linking it to the new solution location, and vice versa. So it starts like this (using Sphinx's "pseudoxml" format):

-- chapter.rst
<exercise_node ids="ex-fizzbuzz">
  <title>
    Fizzbuzz
  <paragraph>
    An exercise
<solution_node ids="ex-fizzbuzz-sol">
  <paragraph>
    A solution

-- answers.rst
<solutionlist_node>

And it becomes this:

-- chapter.rst
<exercise_node ids="ex-fizzbuzz">
  <title>
    Fizzbuzz
  <paragraph>
    An exercise
    <exsol_ref_node refuri="/path/to/answers#ex-fizzbuzz-sol">
      Solution

-- answers.rst
<solutionlist_node>
  <solution_node ids="ex-fizzbuzz-sol">
    <paragraph>
      A solution
      <exsol_ref_node refuri="/path/to/chapter#ex-fizzbuzz">
        (back)

The Latex builder renders this by wrapping each exercise and solution in an answers environment, while the epub builder renders the solution as a popup footnote.2 Making this work:

An example of solution popups on an epub reader

It's a complex dance of operations, but it works enormously well. It even helps with creating a "free sample" subset of the book: the back of the free sample only includes the solutions from the included subset, not the whole book!

"But I hate the syntax"

When I gush about rST to other programmers, this is the objection I hear the most: it's ugly.

To which I say, are you really going to avoid using a good tool just because it makes you puke? Because looking at it makes your stomach churn? Because it offends every fiber of your being?

...Okay yeah that's actually a pretty good reason not to use it. I can't get into lisps for the same reason. I'm not going to begrudge anybody who avoids a tool because it's ugly.

Maybe you'd find asciidoc more aesthetically pleasing? Or MyST? Or Typst? Or Pollen? Or even pandoc-extended markdown? There are lots of solid document builders out there! My point isn't that sphinx/rst is exceptionally good for largescale documentation, it's that simple markdown is exceptionally bad. It doesn't have a uniform extension syntax or native support for pre-render transforms.

This is why a lot of markdown-based documentation generators kind of hack on their own preprocessing step to support new use-cases, which works for the most part (unless you're trying to do something really crazy). But they have to work around the markdown, not in it, which limits how powerful they can be. It also means that most programmer tooling can't understand it well. There's LSP and treesitter for markdown and rst but not for gitbook-markdown or md-markdown or leanpub-markdown.3

But if you find a builder that uses markdown and satisfies your needs, more power to you! I just want to expose people to the idea that doc builders can be a lot more powerful than they might otherwise expect.


No newsletter next week

I'll be in Hong Kong.

Update 2024-07-31

Okay since this is blowing up online I'm going to throw in a quick explanation of Logic for Programmers for all of the non-regulars here. I'm working on a book about how formal logic is useful in day-to-day software engineering. It starts with a basic rundown of the math and then goes into eight different applications, such as property testing, database constraints, and decision tables. It's still in the alpha stages but already 20k words and has a lot of useful content. You can find it here. Reader feedback highly appreciated!


  1. rst is actually independent of Sphinx, but everybody I know who writes rst writes it because they're using Sphinx, so I'll use the two interchangeably. Also typing rST is annoying so I'm typing rst. 

  2. This is why I attach exsol_ref_nodes and not default reference_nodes. Sphinx's epub translator uses an attribute passlist I need to workaround in post-rendering. 

  3. This is also one place where rst's ugly syntax works in its favor. I've got a treesitter query that changes the body of todo directives and only todo directives, which is only possible because the rst syntax tree is much richer than the markdown syntax tree. 

Logic for Programmers Update

21 June 2024 at 18:02

I spent the early week recovering and the later week working on Logic for Programmers ([init] [update]) because I have a self-imposed deadline of mid-July, backed up by a $1000 toxx clause. Here's where I currently am:

1: The book is now 14k words. About 4k are "basics", covering propositional logic, predicate logic, and set theory. Another 1500 words or so are things like the introduction, appendixes, and preliminary exercises. The rest is "techniques". I'm hoping for 20k words for the pre-alpha first draft.

(I shared an early version of the "basics" on my patreon.)

2: Based on some feedback from Saul Pwanson, I've organized each "technique" as a three parter: first a common programming situation, then how logic applies, then something cool that knowing logic enables. For example, testing → specifications of functions → property testing. The other techniques are

  • Functional correctness → contracts
  • Data invariants → stronger database constraints
  • Case coverage → decision tables
  • Conditionals → simplifying conditionals

The last "technique" is an overview of tools that you need some knowledge of logic to start using in the first place: constraint solving, logic programming, formal verification, formal specification. I'm trying to decide if-and-how to cover Z3 and SAT solving, or if there's anything else I'm missing.

2.5: Two techniques I haven't written about but think could be good material: state invariants, modeling requirements. Not sure about the best approaches to either. I'm also looking for other possible applications of logic in general. The key is that it can't just be something that logic helps you appreciate, logic has to help you do it better.

3: Beyond adding those new sections, I want to have more details of technique, more examples, and some exercises. I don't know the right proportion of them yet.

I wrote a special exercise directive: for html, it puts the solution in a dropdown, while for pdfs it puts it in the back of the book. I really like how Sphinx supports these kinds of extensions, though I wish the ecosystem was a little more robust (or the docs better).

4: I think that when the alpha version is out in July, the book is about 40% done. This is because I suspect that reader feedback will lead to a lot of revisions and improvements to the content, as well as ideas for new technique chapters. After that's incorporated, it's 80% done, and that's when I'll figure out nice typesetting and a good cover image.

5: There's now a leanpub page. I wanted to figure out the logistics of distribution early. Please do not buy it yet.

Have a great weekend!

Nondeterminism in Formal Specification

11 June 2024 at 14:30

Just an unordered collections of thoughts on this.

In programming languages, nondeterminism tends to come from randomness, concurrency, or external forces (like user input or other systems). In specification languages, we also have nondeterminism as a means of abstraction. Say we're checking if a user submitted the correct password. A software implementation might look something like this:

# pythonish
def attempt_login(user, pw):
    submitted_hash = hash_password(pw)
    actual_hash = db.get_password(uid=user.id)
    if submitted_hash == actual_hash:
        login()
    else:
        reject()

There's lots of implementation details in confirming the correct password: how we hash it, how we determine the right hashing mechanism, where we retrieve the actual password from, what query we use to retrieve it, etc. At a higher level of design, though, we might not care about any of that: we only want to model what happens if the password is correct, not how we determine correctness. So we can use nondeterminism to handwave that all away:

\* TLA+
AttemptLogin ==
    \/ Login \* \/ means `either`
    \/ RejectLogin

When the model checker tries to evaluate AttemptLogin, it will branch and create separate checking executions for picking Login or RejectLogin. This gives us the consequences of "user submits the wrong password" without having to write a bunch of spec figuring out what the wrong password is.

Mastering abstraction-nondeterminism is key to writing good specifications. It's why you can model a complex, detailed system with a much smaller specification and find bugs that map back to actual possibilities.

Two challenges of nondeterminism

  1. Nondeterminism is incredibly capable, which makes it not very tractable (ibid). It's hard to express properties on subsets of the model-space, like "If a user never gets rejected, then property XYZ always holds."
  2. Abstraction-nondeterminism pushes the spec further from the real-world implementation, so it's harder to see that your implementation matches it.

These tend to be bigger problems if the nondeterminism is deep in your design, kinda like functional core, imperative shell. We want to push it closer to the edges:

AttemptLogin(pwd) ==
    IF pwd = GOOD THEN
        Login
    ELSE
        RejectLogin

Next ==
    \E pwd \in {GOOD, BAD}:
        AttemptLogin(pwd)

It's still nondeterministic, but it happens earlier, so is easier to control. We could for example swap {GOOD, BAD} with {GOOD} to guarantee that login is always successful.

In the extreme case, each component of the machine is locally deterministic, even if the components are concurrent. This is closest to something implementable. The world (ibid) can continue to be nondeterministic.

Initial states and prophecy variables

Specification languages can also have multiple initial states. We can use this to push nondeterminism into the spec setup by adding variables that encode what events will happen. For example, we could initialize the variable attempts_needed as an integer in 0..3, then write AttemptLogin like this:

AttemptLogin(pwd) ==
    IF pwd = GOOD THEN
        Login
    ELSE
        RejectLogin

Next ==
    IF attempts_needed > 0 THEN
        /\ AttemptLogin(BAD)
        /\ attempts_needed' = attempts_needed - 1
    ELSE 
        /\ AttemptLogin(GOOD)

Now, for each initial state, the spec is wholly deterministic, and thus tractable. We can easily distinguish properties that hold if the user never sends the wrong password (attempts_needed = 0) vs if they get it wrong a few times.1 We can also use this to check only a subset of the specification at a time, which can help with checking very large state spaces.

Leslie Lamport (the inventor of TLA+) calls these kinds of helpers prophecy variables, because they "predict" what nondeterministic action the system will take. You can't always use prophecy variables to completely remove nondeterminism from your specification, but it can help!

Refinement and nondeterminism

One of the main use cases of refinement— proving one specification implements another— is to remove abstraction nondeterminism from a system. I really need to add something about this to learntla.

An Exercise

I think this might be a good exercise to practice writing nondeterminism:

In the game Get Fifteen, players take turns picking numbers 1..9. Players can pick each number only once and cannot pick numbers their opponent picked. Winner is the first player to have three picked numbers that add to 15. This is identical to playing TicTacToe, but is easier to formally specify (no grid).2

Model a player tictactoe against an AI. Either can start first. The player needs to follow a deterministic strategy you implement. The AI just nondeterministically picks any available number. The property to check is "the player never loses".

(I haven't actually modeled this, but if people give it an honest shot and run into problems I'd be happy to give the specs a look!)


Plan for June: Logic for Programmers

I've decided to make a serious attempt to write an acceptable first draft of Logic for Programmers by July 15. If I don't have something I can readily turn into a Leanpub early-access by then, I will give $1k to a random newsletter subscriber (or donate it to a charity of their choice). I reserve the right to push the deadline if I get a new consulting contract.


  1. one thing we've lost is the user can no longer fail infinite times; they're always guaranteed to log in eventually. This is fixable in a few ways. 

  2. If you're not seeing it, try overlaying a tictactoe board on top of a magic square

I've been thinking about tradeoffs all wrong

5 June 2024 at 15:14

Sup nerds, I'm back from DDD Europe! Still processing the conference and adjusting to the horrible jetlag but also itching to get writing again, so let's go.

The opening keynote to the conference was called "Modern Tradeoff Analysis for Software Architecture". It was mostly about the factors that went into whether you should split up services into microservices and glue them together. Unsurprisingly "tradeoff" came up a whole lot: between throughput and latency, between simple and distributed transactions, between stability and volatility, etc. At the end he gave the example of comparing a queue-based architecture (SQS) to a pub-sub topic-based architecture (SNS).1 He said that the queues made it harder to add new services, but the topics make it harder to do access-control.2

That's when I realized I've been thinking about tradeoffs all wrong!

See, normally I think of tradeoffs as framing two positive things, like saying "SQS has better control while SNS is easier to add new services". Instead the speaker framed it as two wholly negative things, and the tradeoff is which negative thing you want less of. We can do this with any tradeoff, because any positive quality about X can be turned into a negative quality of Y. Instead of "X is faster, Y uses less space", we can say "X uses more space, Y is slower."

I don't know if this framing was intentional on the speaker's part, but I like it so much more. What's the difference? Even if you pick the least-bothersome drawback, you still have a drawback. Going back to the SQS vs SNS example, let's say you hate the how hard adding new services to SQS is, so you switch to SNS. Then you realize it hard to do access-control, and this is a worse problem than the SQS problem was. So the right tradeoff is to go back to using SQS.

But now it's hard to add new services again. And that sucks! You don't get to stop thinking about the pros and cons just because you made a decision. There's still a problem you have to deal with.

Two rules of thumb

Rule one: after I make a decision about a tradeoff, I need a plan to address the negative consequences of my choice, in a way that doesn't also add in the negative consequences of the other choice.

Rule two: this explains a lot of the movement (or more cynically, churn) in software tech. Even when we make the right choice overall, in the moment we're still under a lot of pain, and so we look for new tools/technologies/practices that address the pain we currently feel.

Bonus rule: part of the reason we do Software Mimicry is to reduce the pain of a choice we made.


The conference talk was my last big project for spring. My consulting calendar is free so I'll try to devote some time to my personal projects in "progress" (learntla updates, alloy workshop, logic book). It's a little tough without a hard deadline!

No "things I'm curious about" because I'm still v. tired


  1. He used SQS and SNS because they're the names of the AWS services. I'm going to use them as shorthands for the architectures because SQS is easier to write than "queue-based architecture" 

  2. I don't know if this actually true, but I'll go with it because that's how he framed it. 

NoCode Will Not Bring Computing to the Masses

21 May 2024 at 17:54

No Newsletter next week

I'll be giving my conference talk at DDD Europe.

NoCode Will Not Bring Computing to the Masses

I don't have a whole lot of time this week so here's something that's been on my mind a little. I haven't researched any of the factual claims; consider this more some incoherent early-stage theory.

Define a "NoCode" tool as "any tool made to help people without any programming experience do things that normally require programming." Most software engineers have a very dim view of NoCode, mostly because the primary way SEs interact with it is to clean up the messes left by overly ambitious nocoders. I don't think it'd be controversial to say there's a wide variety of problems nonprogrammers should not try to solve with NoCode tooling.

On the other side of the spectrum, there are a lot of problems that are trivial to do if you can program, and impossible without it. The canonical example I use is "upcasing a string"; others are "find strings of digits in a document", "figure out each student's average score across several tests", etc. If you don't know how to program, you either have to find someone else's program that does this thing or do it manually. This is where I think NoCode has real potential to help the average computer user: it can let them do simple programming tasks without having to learn all of programming first.

The goal, then, is to figure out what defines the space of "trivial for programmers, impossible for regulars". I think a good first-metric would be "equivalent lines of python". If something can be done in L=10 lines of not-very-clever Python then it should hypothetically be nocodable. If it takes H=50 lines, it definitely should not be. In between is a grey area. These numbers are just off the top of my head; finding the real values of L and H would require a lot of research and user examples.1

NoCode adapted to this space probably wouldn't be that profitable, but it would do the most good for the most number of people.

"The most good" will do the least good

I wanted to talk about some of the consequences for NoCode design, but couldn't figure out where to start, so did the wise thing and took a walk. On the walk I realized that no matter how much a tool empowers the user, almost nobody will use it.

Proof: how many programmers have iPhones? How many of them use Shortcuts? I mean, I have Tasker on Android, I know how to use it and have used it in the past, and still I regularly don't use it.

It's not enough for a tool to solve your problem for you to use that tool. You have to also A) recognize the problem you have is solvable, B) recognize the tool can solve it, and C) be reasonably confident that you personally can solve the problem with the tool. You need knowledge, skill, and the right mentality. Even programmers, who already have that mentality, don't automatically generalize it to all applicable domains. Otherwise we'd be doing a lot more with our phones.

It's unreasonable to expect the average person will use any NoCode to solve their problems, regardless of how easy the tools are. This leaves the people for whom it provides economic value, which is why all modern NoCode tools are oriented towards business.

Do people want grey markets?

There's two exceptions to this rule I can think of.2 The first is Excel/GSheets, which we do see regular people pick up. I'd venture this is because "very simple data organization" and "semi-complicated math stuff" is a common enough thing enough people need. Maybe also they have gentle on-ramps: you can solve basic problems while knowing very little. Finally, there could be a social aspect; everybody already knows about spreadsheets, so it's not as big a jump to start using them.

The other major exception is that I've seen a lot of people who do use tools like Shortcuts and Tasker… as consumers of other people's published automations. These tend to look more like "scripts" than "applications", suggesting that our official marketplaces are designed to deliver applications, but people also need scripts.3 The NoCode gives them a "grey market" they can use to install scripts without using the official heavyweight channels.

One negative consequence of this is that any widely-used NoCode also becomes a potent vector for malware, because the people who are running scripts are ill-equipped to audit them.

Ok, I checked the code and it said (and please bare with me here):

-If flibbity is frabjous, zibble wobble at 42% and blippity blop playlist; otherwise, schnoofle Grobber Mode.

SOMEBODY HELP — source


Trying to pull all of my thoughts into something coherent:

  1. People need scripts.
  2. NoCode can make scripting much easier.
  3. Even if NoCode makes scripting easier, people won't actually write scripts.
  4. What people want is to be easily able to find and run someone else's script like they do an application.
  5. NoCode could make finding and running other people's scripts easier, too.
  6. Doing so is potentially very dangerous, as people aren't able to evaluate scripts for safety.

  1. Have you ever noticed that High > Low, but H in vim moves you backwards and L moves you forward? Deep, man. 

  2. IFTTT/Zapier might be a third exception: people disconnecting disparate services with a NoCode tool. But I don't have enough firsthand knowledge to be able to say anything about them, and I don't know how popular they actually are. 

  3. The difference between "scripts" and "applications" seems very important to all this. See Why bother with Scripting? for a good discussion on this. 

What I look for in empirical software papers

16 May 2024 at 16:58

Behind on the talk and still reading a lot of research papers on empirical software engineering (ESE). Evaluating a paper studying software engineers is a slightly different skill than evaluating other kinds of CS papers, which feel a little closer to hard sciences or mathematics. So even people used to grinding through type theory or AI papers can find ESE studies offputting.

So here are some of my thoughts on how I approach it. This is mostly unedited and unorganized, just getting my thoughts down.

The high-level view

The goal of ESE is to understand how we develop software better in the real world. Compared to "regular" computer science, ESE has two unique problems:

  1. As with all kinds of knowledge work, software development is highly varied and lots of factors affect the final outcome.
  2. Software engineers develop new tools/processes/paradigms incredibly quickly and there's maybe a 1000:1 ratio of developers to researchers.

This makes it hard to say anything for certain with a high level of confidence. Instead of thinking of whether papers are correct or incorrect, I look for things that make me trust it more or less. I'll use this interchangeably with "like/dislike": "I like this study" is the same as "I trust it more".

Quantitative Metrics

I like metrics that are clear, unambiguous, and sensible to practitioners. "Quality" and "defects" are too ambiguous, "coupling" and "cohesion" aren't sensible. "Number of support tickets filed" and "distribution of model code lengths" are better.

In practice, clear and unambiguous metrics end up being more narrowly-scoped. I trust research on narrow metrics more than research on broad metrics, even if the former are less applicable to software at large.

Clinical Trials

I don't like "clinical trials", where developers are split into control and experiment groups and asked to complete a task. The problem is sample size: actual clinical trials in medicine are run on thousands of people at a time and cost millions of dollars. Most software clinical trials study less than 50 engineers at a time. With such a small sample size, there's just too much room for confounding factors.

It doesn't help that most studies are done on university students, which may not generalize to more experienced professionals.

There's a couple of exceptions, though. I'm more inclined to trust clinical trials how how to teach CS, as students are the target population anyway. And I pay more attention to clinical trials that find enormous effects. For example, Need for Sleep finds that a night of sleep deprivation reduced the trial group's coding quality by fifty percent.1 That's not the kind of effect you can write off as a confounding factor!

(I'd still like to see it replicated, though. Also, does mild sleep deprivation effect coding skill?)

Other Quantitative Research

Scraping GitHub for data is fine, as long as the researchers are careful to audit the data. Don't publish a groundbreaking paper claiming a correlation between language and program quality but accidentally count each bitcoin bug as 17 C++ bugs.

Developer surveys are good for measuring things like developer perceptions, emotional state, how widespread a technology is, etc. I don't think they're good at measuring the effectiveness of practices or determining if those opinions match reality. IE if a survey showed that 90% of developers think that their Java code is buggier than their Kotlin code, this tells us that developer perceive Kotlin as higher quality, not it actually is!

Qualitative Studies

I like qualitative studies, things that aren't gathering hard metrics. If a quantitative study of correctness imperative vs logic programming would be "which has fewer bugs" or "what kinds of bugs does each type have", qualitative would be more like "how do imperative and logic programmers debug stuff differently" or "let's dissect this one unusual bug and see if we learn anything interesting."

IME science fields undervalue the impact of qualitative research. I find a lot less of it than quantitative papers, but I tend to trust the qualitative papers more. Maybe they're just overall higher quality; more likely it's because they're more narrowly scoped. And I believe (without evidence) that a field needs a strong foundation of qualitative knowledge to do stellar quantitative work.

Some papers do both qualitative and quantitative work. The researchers in Abbreviated vs. Full-word Identifier Names started with a clinical trial, and then shadowed individual developers to see how they navigated a code file. I trust the quantitative sections of hybrid papers a little more than just-quantitative papers, but it's not an enormous factor.

"Quantitative" and "qualitative" aren't words anymore.

Reading the Paper

I use the same basic approach with all CS papers: read the abstract and intro, read the conclusion and post-results discussion, then decide if I care enough to look further. If I like the results, I read the methodology more carefully. I skim background info and "related work" mostly to see if they make any claims I'd want to read up on.

Sometimes I'll use an AI to generate a paper summary. I only do this if I'm going to read the full paper; the summary is just to help me digest it.

I trust papers that share their data.

Threats to validity

Most software papers have a "threats to validity" section, where they talk about possible confounding factors. In practice it's mostly a way for the writers to show they covered their bases, by explaining why said confounding factors aren't a problem. I like papers which include TtVs they can't dismiss, or discussing unusual-but-important TtVs. I don't like papers which have bad explanations for why TtVs aren't a problem, or that miss really obvious TtVs.

Example of doing it wrong: Quantifying Detectable Bugs in JavaScript claims that typescript catches 15% of "public bugs for public projects on GitHub". Their threats to validity were "we only looked at public bugs" and "we may have been too inexperienced in typescript to catch even more bugs." One TtV they do not list is that they uniformly sampled all of GitHub for bugs and didn't filter the projects at all. When I checked some of their projects, a bunch were simple lesson repos by novices in code bootcamps.2 Why didn't they mention this TtV?

Example of doing it right: oh God it's already Thursday and I haven't gotten this newsletter out yet

References

For many papers, the most useful part is the reference section, because odds are I'll find a more interesting paper there. Even if none of the references look interesting, I'll still check a few to make sure the original paper is using them correctly. I don't like it when a paper that say reference P makes claim C about topic T, and then I read R and it never even mentions T.

I wish journals let papers have two reference sections, one for the really essential references and one for the incidental ones, like background information or related work.

...Okay, that's all I got this week. Back to the talk mines.


Things I'm curious about

  • DORA Report: What's the state of research about the DORA report? Has anybody been able to replicate their findings? Has anyone evaluated them critically?
  • Puzzle Games: Puzzle games like flow free have thousands of levels. Are these programmatically generated? If so, how is it done, and how do they control the difficulty?

  1. Measured as as "number of tests passed", from a researcher-provided test suite for the task. 

  2. I checked this back in 2019 or so, but it seems like the data is now offline. I checked this back in 2020 or so. Props to making the data available! 

Paradigms succeed when you can strip them for parts

7 May 2024 at 16:48

I'm speaking at DDD Europe about Empirical Software Engineering!1 I have complicated thoughts about ESE and foolishly decided to update my talk to cover studies on DDD, so I'm going to be spending a lot of time doing research. Newsletters for the next few weeks may be light.

The other day I was catching myself up on the recent ABC conjecture drama (if you know you know) and got reminded of this old comment:

The methods in the papers in question can be used relatively quickly to obtain new non-trivial results of interest (or even a new proof of an existing non-trivial result) in an existing field. In the case of Perelman’s work, already by the fifth page of the first paper Perelman had a novel interpretation of Ricci flow as a gradient flow which looked very promising, and by the seventh page he had used this interpretation to establish a “no breathers” theorem for the Ricci flow that, while being far short of what was needed to finish off the Poincare conjecture, was already a new and interesting result, and I think was one of the reasons why experts in the field were immediately convinced that there was lots of good stuff in these papers. — Terence Tao

"Perelman's work" was proving the Poincaré conjecture. His sketch was 68 pages of extremely dense math. Tao's heuristic told him it was worth working through is that it had interesting ideas that were useful to other mathematicians, even if the final paper was potentially wrong.

I use this idea a lot in thinking about software: the most interesting paradigms are ones you can "scavenge" ideas from. Even if the whole paradigm isn't useful to you, you can pull out some isolated ideas, throw away the rest, and still benefit from it. These paradigms are the ones that are most likely to spread, as opposed to paradigms that require total buyin.

Let's take as an example functional programming (FP). ML/Haskell-style FP has a lot of interconnected ideas: higher-order functions, pattern matching, algebraic data types, immutable data structures, etc. But if you ignore all of that away and stick to writing everything in idiomatic Java/Python/C, but the only change you make is writing more pure functions, then you've already won.

This has several positive effects on FP adoption. It immediately demonstrates that FP has good ideas and is not just some weird cult practice. It also means people can "dip their toes in" and benefit without having to go all in. Then they can judge if they're satisfied or want to go further into learning FP. And if they do go further, they can learn new ideas gradually in stages.

By contrast, compare array programming languages (APLs). There's lot of cool ideas in APL, but they're so interconnected it's hard to scavenge anything out. If I want to use multidimensional arrays in Python I can't just whip it up myself, I have to import an external library like numpy and learn a whole bunch of new ideas at once. So most of the ideas in APL stay in APLs.3

A related idea how techniques are only adopted if you can "half-ass" them. TDD is useful even if you only use it sometimes and phone in the "green-refactor" steps, which helped adoption. Cleanroom depends on everybody doing it properly all the time, which hindered adoption.

Scavenging from formal methods

Part of the reason I'm thinking about this is that business has been real slow lately and I'm thinking about how I can make formal methods (FM) more popular. What ideas can we scavenge from writing specifications?

Obvious ones are predicate logic and the implication operator, but they're both poorly supported by existing programming languages, and I'm not sure how to best pitch them.2 Also, they're "too big", and it takes a lot of practice to wrap your mind around using quantifiers and implication.

Sets, definitely. Sets are unordered, unique collections of objects. Lots of things we store as lists could be sets instead. I use sets way more after learning some FM.

There's also a lot of high-level ideas in FM that usefully apply to all other code. System invariants are things that should always hold in every possible state. More broadly, we can categorize properties: "safety" is roughly "bad things don't happen", "reachability" is "good things always could happen", and "liveness" is "good things always do happen". Fairness, maybe. Refinement is definitely too in-the-weeds to be scavengable.

I'd also call out decision tables as a particular technique that's extractable. You can teach decision tables in five minutes to a completely nontechnical person.

As I said, short newsletters for the rest of this month. Have a great week!


Things I'm curious about

  • Logic Programming: What ideas are scavengable from LP? I know that pattern matching originated in LP and found its way to FP from there, but that's associated with FP more now.
  • Nontechnical People: What's a good word for them? "Nondev" seems like it leaves out data scientists, DBAs, etc. "Normie" and "Muggle" are too offensive. "Layfolk" just sounds weird.
  • TLA+ workshops: Would people be interested in an "intermediate" TLA+ workshop, for people who've already written production specs but want to get better? I'm thinking half a day, 6-8 people, with people bringing up what they want to learn about in advance. If there's enough people who'd want this I can plan something out.

  1. Domain-Driven Design. 

  2. This is one big reason my attempt to write a book on predicate logic has been in a long stall. 

  3. One idea that mathematicians scavenged from APLs is the Iverson bracket: [P] = P ? 1 : 0

"Integration tests" are just vibes

1 May 2024 at 15:52

New blog post! Software Friction is about how all the small issues we run into developing software cause us to miss deadlines, and how we can address some of them. Patreon here.

"Integration tests" are just vibes

You should write more unit tests than integration tests. You should write more integration tests than unit tests. "Integrated tests" are a scam.

But what exactly is an integration test? Is it tests that cross module boundaries? Is it tests that depend on several pieces of behavior, even in one module? Is it anything that tests "emergence"1? Anything with side effects? If get_active_users() runs a database query, are all tests against it integration tests?

No wonder both Twitter and Google have (at one point) claimed that "integration test" wasn't a useful term. There are just too many different things people mean "integration".2

Even if we can't define integration test, we still have recognize them. There are tests that are bigger than our "unit tests" and smaller than an end-to-end test, and people have strong opinions about writing more or less of them. We can have a sense of something being an "integration test" even if we can't strictly define it.

I think it's more useful to think of integration tests in terms of family resemblances. There are certain measurable properties of tests, and integration tests measure relatively different than a unit test. One quality is test speed: unit tests are "fast", integration tests are "slow". This does not mean all integration tests are slower than any unit test (unit property tests can be very slow). But slow tests are more likely to be integration tests than unit tests.

"Integration tests" are a vibe we get based on how we perceive a test's properties. Some of these properties are:

  1. The number of lines of production code executed as part of the test.

  2. Number of boundaries crossed, module/object or otherwise.

  3. Number of unit tests you can decompose the test into.

  4. Number of external services required to run the test.

  5. Amount of engineering effort to write/run the tests. Does it need separate files, a separate language, setting up lots of docker instances?

  6. Nonlocality of discovered bugs: if the test fails, how many lines you have to look at to find the problem.

  7. Likelihood a test failure is due to "emergence": things are correct in isolation but interact in a problematic way.

The more of these have high values, the more likely a test is to be an integration test.

Miscellaneous Thoughts

A test that hits the database (external service) can still be a unit test if it runs quickly and covers a small amount of code. Tests that make HTTP calls are less likely to be fast, so more likely to be integration tests.

Mocks and pact-style testing push some of these values down, by reducing boundaries crossed and lines of code executed at the cost of increased test infrastructure. Contracts and assertions increase bug locality, since the test fails the moment a pre/postcondition is broken instead of when the test completes.

If you push the values high enough, do you get end-to-end tests instead of integration tests? Or are E2E tests fundamentally different from integration tests in some way?

Is it possible for a unit test to uncover emergence, or is emergence a sufficient condition to say that something is definitely an integration test?


Things I'm curious about

Gonna make this a weekly thing maybe (or whenever I have enough new stuff to ask about).

  • Sokoban. Sokoban is "easier" than Chess or GO (PSPACE vs EXPTIME), but Sokoban AIs regularly fail at simple puzzles that humans can do (ref ex1 ex2). What makes Sokoban "so much harder" for AIs? I don't think it's just a lack of interest because I've found Google and Deepmind papers on solving sokoban (and not making much progress).
  • NFCs: I impulse-bought a bunch of NFC chips. I'm already using them to open train trackers, what else can I do with them? I'm not a mobile dev but can whip up Taskers and server code pretty quickly.

If you know something about either of these, feel free to share! Or write about them publicly and share the link with me. You can just reply directly to this email.


  1. Geeze I can't believe I tried to make "automanual" a thing 

  2. "Unit test" is also fuzzy but we at least agree that add(2, 3) == 5 is a unit test. 

"Testing can show the presence of bugs but not the absence"

23 April 2024 at 20:19

Program testing can be used to show the presence of bugs, but never to show their absence! — Edgar Dijkstra, Notes on Structured Programming

Dijkstra was famous for his spicy quips; he'd feel right at home on tech social media. He said things he knows aren't absolutely true but will get people listening to him. In the same essay he discusses how testing could, in specific cases, show the absence of bugs:

  • If a function has a finite set of inputs, like floor(float), you could prove correctness by simply testing all possible inputs.
  • Even if a function has infinite inputs, tests prove the absence of bugs for tested inputs. You could say "date_of_easter works for any date within 10,000 years of today" and have "proven the absence of bugs" for the subset of dates people care about.1
  • You can use testing to complete a proof. For example, you can divide inputs into "equivalence classes", prove that if the function is correct for one value it will also be correct for all values in that class, and then test one value from each class.

In the majority case, however, testing cannot show the absence of bugs. Testing f(16) does not give you any guarantees about all x. f(x).

There are two possible takeaways. The reasonable one is that we shouldn't be blindly confident in testing, should use a variety of approaches as part of defense in depth, and always be careful about what we claim. The unreasonable takeaway is that testing is an inferior verification method, and we should use more "rigorous" approaches instead of testing.

It's unreasonable because methods that show the existence of bugs are more robust than methods that show the absence of them, and this makes them more useful more of the time.

Comparing tests and proof

Let's say you have a pseudocode function like this:

fun elem_before_x(s: int[], x: int) returns (r: int | NULL) {
    let i: int = index(s, x);
    if (i == NULL) {
        return NULL;
    } else return s[i-1];
}

Property:
 if r == NULL then x notin s
 else some i in 0..<len(s): s[i] == r && s[i+1] == x

The bug here is that we're out-of-bounds if x is the first element of the array. Testing will find this bug if you remember to test that case. Otherwise, we'd have a bug and a passing test suite. Tests can't verify the absence of bugs.

Now imagine we do find it and fix it:

else if (i == 0) {
        return s.last;
    } 

Modifying the property is left as an exercise to the reader. Once again, the passing test suite means we haven't verified the absence of the bug. Now, how does formal verification do on the same problem? I wrote a Dafny version of the same problem, here, both with and without the fix. Looking at v1, we see it finds the s[0] error:

>   var v := index.Extract();
>   return Some(s[v-1]);

(19,16): Error: index out of range    
   |
19 |     return Some(s[v-1]);

This error also goes away in v2: the bug is proved to be absent. But v2 adds a new error:

> ensures if r.None? then x !in s
>           else if s[|s|-1] == r.value then s[0] == x
>           else s[IndexOf(s, r.value)+1] == x

(7,20): Error: index out of range
  |
7 |             else if s[|s|-1] == r.value then s[0] == x
  |                      ^^^^^^^

When would s[|s|-1] be an index out of range? When the list is empty. But we know that's not possible: if we return a non-null value then the list must not be empty, because we found x in it! But the prover doesn't know that, it can't infer the relationship between the returned value and what we know about the list.2 We have to annotate our code with proof hints for it to recognize that if we return a number, the list is nonempty.

In short: the prover definitely finds the out-of-bounds bug, but also raises a spurious "empty list" bug. The test might not find the former but also won't raise the latter. Tests can have false negatives but not false positives, while provers can have false positives but not false negatives.

(I'd say "formal verification can prove the absence of a bug but never the presence", but that's not true. Dafny can sometimes give you a concrete counterexample to a property, and type systems detect true positives like "you passed an array into an Int -> Int function" all the time.)

Knowing there "might" be a bug is not as useful as knowing there is one. Empirically speaking people mostly ignore compiler warnings and plenty of retrospectives find people ignore "might possibly be a bug" in even mission-critical software. Formal verification often becomes an "all or nothing" deal, either you totally prove the absence of bugs in programs or you get no benefits at all.

And that sucks! It takes a lot of effort to prove something correct, magnitudes more than running some tests. And that's why I consider testing "more robust": even a little bit of testing may find a true bug, while a little bit of verification doesn't get you much.

Correctness is not a thing

I'll end this by getting on my soapbox. Here's something that's really important to understand about "proving the absence of bugs":

You do not prove something's correct. You prove conformance to specification.

It's possible for code to conform to the specification and still have bugs. Two common cases are:

  1. The spec is wrong or incomplete, like v2.dfy is in the above example. Spoiler in the footnote.3
  2. The spec is based on an assumption that's not true in practice, like "no other program is writing to the same file."

This is also important because it puts the specification front-and-center. You can write tests without having a specification in mind, but you need a spec to do any kind of formal method. Doing FM trains you to think about specs, which in turn helps you write better tests.


Things I am interested in

  1. Crossovers from fields besides classical engineering, who worked professionally as something else before switching to software. What ideas from your old job as a journalist/librarian/social worker/acrobat could be useful in software, and vice versa?
  2. Wargaming in software. Have you designed or participated in a devops/security/system design/whatever simulated game? Was it helpful?

If you have insights into either of these, feel free to share! Or write about them publicly and share the link with me. You can just reply directly to this email.


  1. Though you could argue this isn't sufficient for security. If easter(-1e8) causes a memory overflow then that's a bug even if a normal user won't ever call it. 

  2. We're lucky that the Dafny standard library defines a postcondition for index functions saying the output is less than the length of the list. Otherwise the prover would raise a potential out-of-bounds error on s[IndexOf(s, x)]

  3. s[|s|-1] == r.value then s[0] == x isn't true for the input [a, r, x, r]. The good news is that Dafny refuses to verify the code because it didn't match the spec. I only realized this when I was error-checking this piece. (This doesn't refute the point I made above about false positives: the false positive was that s[|s|-1] was out of bounds!) 

What makes concurrency so hard?

16 April 2024 at 20:28

A lot of my formal specification projects involve concurrent or distributed system. That's in the sweet spot of "difficult to get right" and "severe costs to getting it wrong" that leads to people spending time and money on writing specifications. Given its relevance to my job, I spend an awful lot of time thinking about the nature of concurrency.

As the old joke goes, concurrency one of the two hardest things in computer science. There are lots of "accidental" reasons why: it's hard to test, it's not composable, bugs can stay latent for a long time, etc. Is there anything that makes it essentially hard? Something that makes concurrent software, by its very nature, more difficult to write than synchronous software?

The reason I hear most often is that humans think linearly, not concurrently, so are ill-equipped to reason about race conditions. I disagree: in my experience, humans are very good at concurrent reasoning. We do concurrent reasoning every time we drive a car!

More generally, some studies find that if you frame concurrent systems in human terms ("meatspace modeling"), people get quite good at finding the race conditions. So while concurrency might be difficult to reason about, I don't think it's because of a fault in our brains.

In my opinion, a better basis is state space explosion. Concurrency is hard because concurrent systems can be in a lot of different possible states, and the number of states grows much faster than anyone is prepared for.

Behaviors, Interleavings, and Nondeterminism

Take agents1 {1, 2, … n}, which each executes a linear sequence of steps. Different agents may have different algorithms. Think writing or reading from a queue, incrementing a counter, anything like that. The first process takes p1 atomic steps to complete, the second p2, etc. Agents complete their programs strictly linearly, but another process can interleave after every atomic step. If we have algorithms A1A2 and B1B2, they could execute as A1B1A2B2 or A1B1B2A2, but not A1B2B1A2.

Under those conditions, here's an equation for how many possible orderings of execution (behaviors) can happen:

(p1+p2+...)!
------------
p1!*p2!*...

If we have three agents all executing the same 2-step algorithm, that's 6!/8 = 90 distinct behaviors. Each step in the sequence can potentially lead to a distinct state, so there's 6*90=540 maximum distinct states (MDS). Any one of those can potentially be the "buggy" one.

In most cases, the actual state space will be significantly smaller than the MDS, as different behaviors will cross the same states. Counterbalancing this is just how fast this formula grows. Three 3-step agents gives us 1700 possible behaviors (15K MDS), four 2-step agents instead have 2500 (20K MDS). And this is all without any kind of nondeterminism! If one step in one of the agents can nondeterministically do one of three things (send message M₁, send message M₂, crash), that triples the number of behaviors and the MDS.

It's pretty common for complex concurrent systems to have millions or tens of millions of states. Even the toy model I use in this article on model optimization has about 8 million distinct states (though with some work you can get it much lower). I mostly think about state spaces in terms of performance because large state spaces take a lot longer to model-check. But it's also why concurrency is essentially hard to reason about. If my theoretical MDS is two million states, my practical state space is just 1% that size, and my human brain can reason through 99.9% of the remaining states… that still leaves 20 edge cases I missed.

Shrinking the State Space

Here's a heuristic I use a lot:

All means of making concurrency 'easier' are concerned first and foremost with managing the state space.

It's not 100% true (no heuristic is) but it's like 60% true, and that's good enough. State spaces grow quickly and bigger state spaces cause problems. If we want to make maintainable concurrent systems, we need to start by shrinking the space.

Like look at threads. Threads share memory, and the thread scheduler has a lot of freedom to suspend threads. So you have lots of steps (potentially one per line of code)2 and any interleaving can lead to a distinct state. I can use programming constructs like mutexes and barriers to "prune" the state space and give me the behaviors I want, but given how big the state space can be, I have to do a lot of pruning to get the right behaviors. I can make mistakes in implementation, "misshape" the space (like by adding a deadlock), or not notice a buggy state I need to remove. Threads are very error prone.

I could instead switch to memory-isolated processes. The scheduler can still schedule whatever interleavings it wants but the processes can't muck with each other's internal memory. Internally a process is still executing A1A2A3A4, but if the only steps that involve external resources (or interprocess communication) are A2 and A4, then we only need to think about how A2A4 affects the state space. Three four-step agents have 35k interleavings, three two-step agents have only 90. That's a big improvement!

What else can we do? Low-level atomic instructions do more in a single step, so there's no room for interleaving. Database transactions take a lot of physical time but represent only one logical time step. Data mutations create new steps, which immutable data structures avoid by definition.

Languages have constructs to better prune the resulting state space: go's channels, promises/futures/async-await, nurseries, etc. I think you can also treat promises as a way of forcing "noninterleaving": wait until a future is ready to execute in full (or to the next yield point) and before execution. Please don't quote me on this.

I think CRDTs reduce state space by permitting interleavings, but arranging things so that external changes are commutative: A1B1 and B1A1 lead to the same final result, so there are not distinct states.

Again, this is all a very rough heuristic.

Limits

To a first-order approximation, smaller state space == good. But this doesn't account for the topology of the space: some spaces are gnarlier than others. One that has lots of different cycles will be harder to work with than one that's acyclic.3 Different forms of nondeterminism also matter: "the agent continues or restarts" leads to more complex behavior than "the writer sends message M₁ or M₂." These are both places where "humans are bad at reasoning about concurrency" appears again. We often work through concurrency by reduces groups of states to "state equivalence classes" that all have the same properties. complicated state spaces have more equivalence classes to work through.

(Another way "humans are bad at reasoning about concurrency" can be a real thing: we might not notice that something is nonatomic.)

Some high-level paradigms can lead to particular state space topologies that have fewer interleavings or ones that have more equivalent states. I've heard people claim that fork-join and pipe-filter are especially easy for humans to reason about, which I take to mean "doesn't lead to a gnarly state space". Maybe also event-loops? Where does the actor model fit into all of this?

Another limit: there's a difference between buggy states and buggy behaviors. Some behaviors can go entirely through safe states but still cause an bug like "never reaches consistency". This is called a "liveness bug", and I talk more about them here. Liveness bugs are much harder to reason about.

Okay, that's enough philosophical rambling about concurrency. Concurrency is hard, don't feel bad if you struggle with it, it's not you, it's combinatorics.


Video Appearance

I was on David Giard's Technology and Friends talking about TLA+. Check it out here!


  1. An agent is any concurrent actor, like a thread, a process, a future, a human, etc. Concurrency literature uses "processes" a lot but that clashes with OS processes. 

  2. Or more! If writes are nonatomic, you can get data races, where two threads write to different bits in a byte at the same time. Then a variable assignment can be multiple steps in one line. 

  3. Is there any research correlating unusual subgraphs in state spaces with probable bugs? 

Some notes on for loops

10 April 2024 at 17:08

New Blogpost

Don't let Alloy facts make your specs a fiction, about formal methods practices (and a lot of Alloy). Patreon link here.

Some notes on for loops

I sometimes like to sharpen the axe by looking at a basic programming concept and seeing what I can pull out. In this case, for loops.

There are two basic forms of for loops:

for(i = 0; i < 10; i++)

for val in ["a", "b", "c"]:

The second is usually called a foreach loop, and I'll call the former "forstep", after the ALGOL-60 construct. Almost all languages have one of the two, some have both. While forstep loops show up early, foreach loops only appear in PLs ~20 years later.

Looking at this, here's some things I see:

Construct reduction

Given a foreach loop, I can syntactically translate it into a forstep loop:

for val in arr:
    # ...

-->

for(i = 0; i < arr.len; i++) {
    val = arr[i];
    // ...
}

There's a lot of terms we use to express this property. We can say that foreach is homomorphic or reducible to forstep, or that foreach can be constructed out of forstep, or others I can't think of rn.

If we could also construct forstep from foreach, we could say the two are equivalent or isomorphic. In theory we can do this if we had a range operator:

for(i = 0; i < 10; i += 2) 

--> 

for i in range(0, 10, step=2);

Most forstep implementations, though, can do things like modify the iterator in the middle of the loop:

for(i = 0; i < 10; i++) {
    // stuff
    if(cond) i = 0;
    // stuff
}

You can't convert this to a foreach loop: it's an "irreducible" feature.

Most people consider modifying the index bad practice, though. 99% of the use cases of forstep loops are constructible with foreach loops, so the two are morally equivalent. I'll treat them as roughly interchangeable going forward.

Primitives and Expressivity

For loops are reducible to while loops:

for(i = 0; i < 10; i += 1) 

-->

i = 0;
while i < 10 {
  // ...
  i++;
}

But while loops are not reducible to for loops:

while(user_cmd != "quit")

-->

???

Again, lots of terms to cover this: while loops are more "primitive", more "powerful", more "low-level", more "expressive", etc. If you have while loops, you don't need for loops, and if you have goto, you don't need either.

But you shouldn't use a goto when a while loop works or a while loop where a for loop works. This is a generalization of the rule of least power (RLP): given a problem, use the tool that is most specific to the problem. If you can iterate through a list with a foreach loop, don't use a while instead!

There's two reasons for this. One is that it's more communicative to other programmers. If they see a foreach loop, they know it's going to be used for some kind of iteration, while that's not clear with the while loop. Two is the capability/tractability tradeoff: the more limited a construct is, the easier it is to understand and work with. A foreach loop can do less than a while loop, but it's much easier to guarantee that it processes each item only once.

Programming languages evolve to replace the all-powerful "low-level" constructs with more limited constructs that reduce to the low-level ones. It's why you get things like interfaces over inheritance.

The corollary of this is that if you use a powerful construct, it should be because the limited one can't do what you want. Don't force an infinite for-loop!

Extensions to for loops

For loops reduce to while loops, while loops reduce to goto, everything reduces to goto. So there are some kinds of looping goto can do that for and while cannot.

They get a little closer to feature-parity by adding flow-control inside the loop: continue for advancing the loop early, break for terminating it. I didn't look into when or how these were added, but I'd guess they come from looking at what goto can do that loops could not, and adding features as needed.

"Imperative programming" is just for loops

I think one of the most basic functions of a programming language is to iterate through an array. So much so that you can distinguish different programming paradigms by how they iterate.

  • Imperative programming: for loops
  • OOP: a foreach method on iterable objects[^python]
  • Functional programming: recursion, reduce (we'll talk about this later)
  • Logic programming: I'm less familiar with this, but I think it's recursion + pattern matching (which ML and Haskell adopted)
  • Array programming: all operations take arrays by default.

I think this explains why arrays are universal while dictionaries are relatively new: there's no obvious basic operation that covers what you do with dicts to the degree that for covers what arrays do.

(Question: do list comprehensions lead to a new paradigm? How do you describe "iteration" in spreadsheet languages?)

Functional replacements

When people talk about replacements for foreach loops in FP, they usually bring up map and filter. But these are not equivalent. Rather map and filter are "preferred" to loops because they are more limited. It's another instance of RLP.

As far as I can tell, reduce is equivalent to foreach. Morally, anyway. This is interesting because reduce is considered part of the "big three" (along with map and filter), but it's also the one that's the "odd one out". It's the hardest for people to understand, and Haskell even has three versions of it. Is it confusing because it's so powerful?

Map and filter are composable in a way that foreach loops are not, but they're also weirdly less "composable" in one very specific way? Consider something of the form

map f (filter p arr))

-->

out = []
for x in arr:
  if p(x):
    out.append(f(x))

The map/filter iterates over the list twice, while the foreach loop iterates only once. I know Haskell can fuse maps together as an optimization but I don't think you safely fuse arbitrary map/filters? I dunno.

You can fuse map and filter with a reduce:

reduce ((out, x) => 
        if p(x) 
        then out ++ f(x)
        else out
       ), [], arr

This gets really awkward though.

I wonder if "reduce is equivalent to for loops" plus "reduce is awkward for fusion" somehow inspired clojure transducers.

I'll be writing this forever

Things I could have looked at in more detail but ran out of time and energy:

  • "Iterators" as a general concept.
  • Loop invariants and variants. This is really relevant to formal methods in particular.
  • Specific implementations of for loops in languages. ALGOL's loop expression-lists, Python's for else, Raku's phasers, javascript's for in/for of distinction, Ada and Chapel loop parallelization constructs... If I keep going I'll get something as long as this.

Following any of those leads would get us to different insights, as opposed to main thing that ended up catching my interest with these notes, which is how the rule of least power applies to language constructs. I should probably write a blog post about it!

Anyway I think this kind of exercise, looking at something everybody takes for granted and try to learn things from it, is really valuable for understanding software better. I don't know how interesting it is to read about, though. Lemme know if you think you'd like to see this for other stuff, or just share the essays that are downstream of the practice (like writing about RLP).


Podcast appearance: are we engineers?

I was on a podcast, talking about the crossover project and what software engineers can learn from traditional engineering. Check it out here!

Why do regexes use `$` and `^` as line anchors?

25 March 2024 at 16:40

Next week is April Cools! A bunch of tech bloggers will be writing about a bunch of non-tech topics. If you've got a blog come join us! You don't need to drive yourself crazy with a 3000-word hell essay, just write something fun and genuine and out of character for you.

But I am writing a 3000-word hell essay, so I'll keep this one short. Last week I fell into a bit of a rabbit hole: why do regular expressions use $ and ^ as line anchors?1

This talk brings up that they first appeared in Ken Thompson's port of the QED text editor. In his manual he writes:

b) "^" is a regular expression which matches character at the beginning of a line.

c) "$" is a regular expression which matches character before the character <nl> (usually at the end of a line)

QED was the precursor to ed, which was instrumental in popularizing regexes, so a lot of its design choices stuck.

Okay, but then why did Ken Thompson choose those characters?

I'll sideline ^ for now and focus on $. The original QED editor didn't have regular expressions. Its authors (Butler Lampson and Peter Deutsch) wrote an introduction for the ACM. In it they write:

Two minor devices offer additional convenience. The character "." refers to the current line and the character "$" to the last line in the buffer.

So $ already meant "the end of the buffer", and Ken adapted it to mean "the end of the line" in regexes.

Okay, but then why did Deutsch and Lampson use $ for "end of buffer"?

Things get tenuous

The QED paper mentions they wrote it for the SDS-930 mainframe. Wikipedia claims (without references2) that the SDS-930 used a Teletype Model 35 as input devices. The only information I can find about the model 35 is this sales brochure, which has a blurry picture of the keyboard:

Blurry picture of a keyboard

I squinted at it really hard and saw that it's missing the []{}\|^_@~ symbols. Of the remaining symbols, $ is by far the most "useless": up until programming it exclusively meant "dollars", whereas even something like # meant three different things. But also, $ is so important in business that every typewriter has one. So it's a natural pick as the "spare symbol".

Yes this is really tenuous and I'm not happy with it, but it's the best answer I got.

If we're willing to stick with it, we can also use it to explain why Ken chose ^ to mean "beginning of line". ^ isn't used in American English, and the only reason QED wasn't using it was because it wasn't on the Teletype Model 35. But Ken's keyboard did have ^, even when it wasn't standardized at the time, so he was able to use it.

(Why did it have ^? My best guess is that's because ASCII-67 included it as a diacritic and keyboards were just starting to include all of the ASCII characters. The Teletype 35 brochure says "it follows ASCII" but didn't include many of the symbols, just uses the encoding format.)

So there you have it, an explanation for the regex anchors that kinda makes sense. Remember, April Cools next week!


  1. I had this question as part of a larger rabbit hole on the history of regular expressions, which might become a full blog post at some point in May. 

  2. I have a pathological hatred of secondary sources. 

What if the spec doesn't match the code?

19 March 2024 at 16:23

Whenever I talk about formal methods, I get the same question:

Can I use the spec to generate my code?

People are worried about two things. One, that they'll make a mistake implementing the specification and have bugs. Two, that over time the implementation will "drift" and people won't update the spec, leading to unexpected design issues.

These are valid fears, and both are things that happen when writing specifications! Which is why it extra sucks that my answer is usually "you can't". This newsletter will be a quick explanation as to why, plus why it's still a net win to write a spec, and an alternate trick you can do instead to keep things in sync.

Synchronization is too expensive

There's two ways we can keep code in sync with a spec. The first way is program extraction, where you take a spec and generate a complex program from it. Lots of theorem provers, like Coq and Agda, have this functionality. The second way is refinement, where you write code independently and then prove it somehow matches the spec. This is what the seL4 people did with their C code.

Extractable/refineable spec are much, much harder to write than an "unsynced" spec because you lose a level of abstraction. Specifications can exist at a higher level of abstraction than a program can, which is why they can be so much more concise and focused than a program can. For example, here's a quick PlusCal snippet1 for a reader pulling a value off a queue and adding it to that reader's total:

await Nonempty(queue);
with r \in Readers do
     total[r] := total[r] + Head(queue);
     queue := Tail(queue);
end with;

What would that look like in a program? You'd probably have code for setting up the readers, code for reading from the queue, code for persisting the value of total somewhere. If it's in another system like Postgres or redis, you'd have a client library for that, too. And around all of that you'd have logging, heartbeats, event hooks, a whole bunch of orthogonal concerns.

You might not even want to implement the whole spec! This is the machine vs world dichotomy: the spec can encode both the system you're designing and also the environment around that system. Let's say you're using AWS SQS for the queue, so there's a chance you can't connect to it, and a chance it duplicates a message. You can't implement either of those things in your system, only the code that detects and reacts to them. But adding those are par for the course in specification. It might look like this:

await Nonempty(queue);
with r \in Readers do
    either
        \* unreachable!
        status[r] := "error";
    or
        \* duplicate!
        total[r] := total[r] + Head(queue);
    or
        \* happy path
        total[r] := total[r] + Head(queue);
        queue := Tail(queue);
    end either;
end with;

To synchronize spec and code, you need to bring the spec closer to the level of code, which means losing a lot of abstraction power. It's definitely possible, but considerably more expensive. Synchronization is not affordable for most companies.

But specs are still worth it

Even if you can't keep the spec in sync, it's still a net benefit. Consider what happens if you have a bug in production:

  • With a spec: either you didn't implement the design correctly, or an assumption in the design ("the queue is perfectly reliable") was invalid
  • Without a spec: you didn't implement the design correctly, an assumption in the design was invalid, or the design itself has a flaw.

That last one is the most dangerous and the most expensive to fix — this talk estimated that one design flaw cost their team a year of rework. They're also really hard to correctly diagnose without a spec. I've often seen people think a design bug was actually an implementation bug. They'd "fix" the bug and it'd go away for just long enough for everybody to forget about it before suddenly resurfacing.

And then you gotta make sure that the new design isn't buggy, either.

That's why formal methods are a net win, even without syncing with code. With a verified design you can build systems faster and you can fix bugs faster, because you're confident that the bugs you got are "only" implementation errors.2

We're working on it

A confession: the "this is impossible" section was glazing over a lot of interesting work.

First of all, cheaply generating code from specs might not be viable in general, but it's possible for specific subsets of specification. This is mostly the realm of research projects like PGo, which converts (a variant of) PlusCal to Go code. You can see some examples here. I don't think it's going to be production ready for a long time, but at least it's being worked on.

A more exciting approach is generating tests from specs. I know a few companies who've done this, with the most public example being MongoDB. They discuss their approach in eXtreme modeling in practice:

We wrote a new TLA+ specification for this purpose and used the state space explored by the TLC model checker to generate C++ and Golang test cases for aspects of the OT algorithm. These test cases can be automatically generated and run after any change to the model, and ensure that both the C++ and Golang implementations make appropriate changes to maintain equivalence.

More specifically, they took a TLA+ spec and dumped the state graph, then decomposed it into a set of linear behaviors and converted those into about 5,000 tests. This works better than code generation because tests are, in a strange way, "more abstract" than code. Lots of implementations pass the same test suite. So it's less of a jump.

The TLA+ model checker has a bunch of features to make test generation easier: simulation mode lets you grab random traces and ALIAS can fine-tune the trace outputs. There's also an IO extension for calling subprocesses from model checking, which can be used to drive a test harness.

Two important caveats. One, the MongoDB team wrote a bunch of custom code to make this work; there's no off-the-shelf solution for turning specs into tests. Second, they were specifying a low-level, deterministic program, so were able to keep the TLA+ relatively close to the code. Writing a test harness for a distributed system is a lot harder. It's still a promising avenue!


Comments Section

I got a lot of responses to my graph types essay. I compiled some of the most interesting ones into a "Comments Section", here. There's some cool stuff about datalog and Rubik's Cubes there!


  1. PlusCal is a DSL that compiles to pure TLA+. I'm using it here because the equivalent TLA+ constructs are a little harder for outsiders to read. 

  2. One weak point in this is the "An assumption in the design was wrong" problem, but generally it's much easier to adapt a working design that fails under one broken assumption than an ad hoc design that might not have worked in the first place. 

What Mob Programming is Bad At

14 March 2024 at 14:45

Pairing is two people working together to write code, while mobbing is three or more. Pairing has been part of the programming milleau since at least the 90's (with extreme programming), while mobbing is more of a 10's thing. I'm going to use them interchangeably from here on out even though they're not interchangeable, because I think everything I'll say applies to both.

Proponents of pairing/mobbing attribute all sorts of benefits to it: higher quality code, more shared ownership, no more need for code review, even (in the most extreme claims) eliminating bugs entirely. Naysayers claim it slows team velocity down, is emotionally exhausting, and only works with very particular personalities. To which the advocates respond that if mobbing doesn't work, it's a problem with your company culture.

And things kind of go in circles from there, because you can't go anywhere from "it's a problem with the culture." I've made this same mistake in the past too! "Company culture" is the "wizard did it" of software techniques: you can explain the success or failure of anything with "culture". If you think the only possible problem with your $THING is "bad culture", then you're saying $THING doesn't have any problems, which means it's snake oil.1

Now overall I'm mildly pro-pairing. The last couple of places I worked at paired regularly, and when I consult with clients, I "mob specify" as much of the project as possible. This is because pairing is really, really good at knowledge sharing. It's a great way to get people up to speed with a new language, or domain, or even a section of a codebase. I even did a day of mobbing with Hunter Industries to understand it better.2 So while I think the loudest people online make overblown claims, I still mentally treat pairing/mobbing as "underleveraged techniques" along with stuff like formal methods and metamorphic testing.

So here's one thing I just cannot make work with pairing: optimization.

I am always better off optimizing code on my own than as part of a mob. I noticed this a long time ago, but I think I finally figured out how to explain the reasons why.

Bulldozers and lightning bolts

In practice [mobbing] usually feels like a bulldozer rather than a racecar - unstoppable and thorough. (source)

Bulldozers are "unstoppable" because they have high momentum, which is a function of both how heavy something is and how fast it's going. So if mobbing is like a bulldozer, then it's low-velocity, high-momentum. The mob writes less code than a bunch of independent developers but there's a lot of weight behind the code. It's more likely to be "good" code.

Something with a lot of momentum isn't just hard to stop, though. It's also hard to change direction. And in optimization I don't know the right direction. We can't easily predict how an attempted optimization will affect the overall program performance. If I see twelve ways of making something faster, six might have no effect, two could have negative effects, two could lead to minor improvements, and two could lead to order-of-magnitude improvements as long as you didn't do the minor optimizations. That's why benchmarking impact is so important!

The best way out would be to try all 4096 possible combinations of optimizations and then throw away 4095 of them. Lots of code written, lots of code thrown away. I want high-velocity, low-momentum. Instead of feeling like a bulldozer, my mind needs to feel like this:

Image of a lightning bolt

The path to fast code is a stepped leader, constantly changing direction, with dead ends branching off like streamers. Sometimes the dead ends can be just as long as the final path but still go nowhere. You write a lot of code, you write a lot of deep code, and then you throw it all away.

That start-stop motion doesn't work in group programming. It can if I know exactly what optimizations we need to make and it's just a matter of making them. That'd be something like fixing an N+1 query or replacing one algorithm with another. Then we have a direction and momentum is helpful.

Otherwise I much prefer to figure out optimizations on my own and then take my conclusions back to the pair to explain. Mobbing is good for knowledge sharing!

Generalizing this

Now you might argue "mobs throw away code all of the time." The difference is that it's "short-generation". They might throw away a few minutes of code but they're worse at throwing away a couple hours of code, which is sometimes necessary in optimization. The pattern I see is that mobbing is at its worst when code has

  1. A low chance of "surviving" to prod
  2. A long lifespan, even if never survives

What are some other places like this? Exploratory programming. Some kinds of deep bugfixing and refactoring. Local refactors ("pull this into its own function/class") are easy and common in the mob. Complex refactors ("redraw the separation of concerns for these four modules") are harder. Anything where progress happens in sudden jumps and lurches than through a continuous motion.

(You can adapt some pairing practices to handle these better, but IMO that's sensible for people who already prefer working in a group. If you don't pair or only pair part-time, this is something that you should keep doing solo.)

If you disagree come fite me

I did some searching for articles on how to pair/mob on optimizations and couldn't find anything. So if you think that no mobbing is great for writing performant code, don't email me, put it up somewhere online and then email me.

Also, if you tried pairing or mobbing and it didn't work, you should also put that somewhere online. Right now all of the people who are writing publicly about it are evangelists.


  1. Etymology time! The term "snake oil" doesn't come from people selling snake oil as a miracle cure, but from people selling snake oil that was actually mineral oil. "Snake oil doesn't cure anything" comes way later than the idiom. 

  2. Hunter industries is usually wheeled out in these discussions as the model mob programming culture. They're pretty cool people, I'd def manage . 

How to argue for something without any scientific evidence

6 March 2024 at 21:27

Last week I got this interesting question:

I want to write a book about automated testing. Much of the book would be me explaining the best practices I’ve learned. I know these are good practices; I’ve seen them work over and over again. But have no [scientific] data at all to support my conclusions.

My question: lacking such data, how can I talk about something important like testing in an authoritative way? — Jake

This is near and dear to my heart. On one hand, I think empirical software engineering (ESE) is really important and have given several talks on it. On the other, I think ESE's often a trainwreck and find most of the research flawed. And my career is based on formal specification, which hasn't been studied at all.

Thing is, software still needs to get built regardless of how much we know about it. So how do you get people to believe you when there's no science backing you up?

I've written before that you need two things to be considered trustworthy:

  1. Trustworthy expertise: your judgement is correct.
  2. Trustworthy opinions: you aren't deceptive about your judgement.

Empirical evidence is a shortcut to both of these: we tend to think the scientific process is perfectly objective (it's not) and consider it unbiased (we shouldn't). At the very least, it's on average more objective and less biased than most other forms of gathering information. Without it, we need to work harder to earn people's trust.

Trustworthiness isn't the same as authority, but they're closely linked. You can't be an authority without being trustworthy.

ACOUP plug

One of my blogs is A Collection of Unmitigated Pedantry, about the intersection of history and pop culture. I've written more about it here. I want to call out one section of this essay:1

Soldiers going into battle often look to their leaders to perform leadership. After all, the infantryman on the line has little sense if the battleplan is good or not ..., but he’s likely to be reassured if his leaders look like how he imagines good leadership to look. … seeing their king performing that role with excellence could be confidence building – “I may not understand the plan, but we have a good general and thus a good plan.”

Authority is performative. So if you want to talk about something like an authority, you need to perform being an authority. Do things that people expect an authority is able to do. I can't tell you the best way to perform authority, but I can tell you how I try perform it, and you can judge if that'd work for you too.

How I try to perform authority

  1. Be up-front about my biases.
  2. Demonstrate deep familiarity with the field.
  3. Talk about the benefits and drawbacks.
  4. Talk about alternatives in a deep, honest way. Say what they do worse, but also what they do better, and in what circumstances.
  5. Discuss variants, that are still $THING but not how I do $THING.
  6. Discuss common arguments against, rejecting some as invalid, acknowledging some as valid, why it's good despite those valid arguments.
  7. Argue why, on the whole, I still take the position I do.

Let's try applying these principles to a testing topic like Test-Driven Development. I've written about my complicated feelings before but not written something trying to convince people to use it.2 If I had to, this would be my first-draft outline:

  1. Define test-driven development, both the general principle (write tests before code) and the specific form (strict red-green-refactor cycles).
  2. Give a persuasive example of a case where TDD performs best: adding regression tests to fix a bug in code.
  3. A more involved example of "greenfield development" with TDD. This would probably look like writing two modules, one with TDD and one where I write all of the tests at the end, and compare the two. Show where the two approaches lead to the same tests and where they lead to different tests.
  4. Claimed merits of TDD: leads to better code, low coupling/high cohesion, "design pressure", etc. Mention which ones I believe, which ones I don't believe, which I think require more nuance, etc.
  5. Common arguments against TDD: leads to "TDD code", slows you down, plenty of great software is written without it, etc. Respond to these in ways that show I respect the people making these claims even if I disagree with them. 1. Go deeper into the class of "types/integration tests/formal methods are better" arguments. Use this to talk about defense-in-depth.
  6. Alternatives: writing tests at the end, having a separate QA team, cleanroom, not testing at all, etc. Show cases where TDD is better than these and cases where it's worth. Argue that the former are more common. 1. Talk about where I personally don't do TDD.
  7. Common variants still in the TDD umbrella: behavior-driven development, acceptance-test driven development, test && commit || revert
  8. Further reading. I would not recommend the original book because I haven't read it. That might undermine my expertise but it supports my honesty. If I really wanted to come across as an authority, I'd read that book first before writing my spiel.

(This looks like an awful lot of work, but that's mostly because TDD is such a huge and controversial topic, so I have to do a lot of work to perform authority. If I focused on a specific thing, like "mocks or not" or "should you test private methods", it'd all be a lot easier.)

The goal here is to "show my work" and demonstrate that 1) I'm well-versed in TDD, 2) I understand the flaws of TDD, 3) I genuinely value TDD anyway, 4) that my opinions are motivated by deeper engineering principles. Even if someone disagrees with my opinions, they still gain knowledge and don't leave thinking that I'm trying to hoodwink them.

There are other ways to perform authority

One thing I do a lot is showcase advanced knowledge on something that most practitioners will never need, but can appreciate the expertise behind it. Like it one's thing to show I know the basics of TLA+, another thing to show I know about hyperproperties and refinement. I don't know what the equivalent for TDD would be, maybe testing statistical properties?

In writing this I notice that this is specifically about performing trustworthy expertise. What would performing trustworthy opinions look like?

If someone comes in demanding empirical evidence

First ask yourself: are you making the kinds of radical claims that require lots of scientific backing? Not "TDD/Formal methods is good", but "TDD is equivalent to hand-washing" or "if you don't use formal methods you're unprofessional." If so, then you're not performing authority well and people are picking up on that. Dial it down.

Second: try to find out if they genuinely care about empirical software engineering. Are they interrogating their own practices under the same lens? If not, then they're just using ESE as a cheap gotcha and wouldn't change their mind even if you have studies.

Otherwise, talk about the paucity of ESE knowledge in general, raise whatever few papers you do know about, say they're suggestive and limited and more research is needed, etc etc. Show that you've at least thought about the scientific side of things.


Consulting stuff

My current gigs are winding down so I'm looking for new consulting work in formal methods (TLA+ and Alloy). Details here.


  1. He covers this in more detail in Commanding Pre-Modern Armies but that's less quotable 

  2. At least not since 2017, and I like to think my writing and software skills have improved in the past six years. 

Know (of) the right tool for the job

27 February 2024 at 18:26

Last week's Picat essay did pretty well online! One of the common responses was "This looks really interesting but I don't know where I'd use it."

For Picat there was a gap of five years between me hearing about it and actually learning it. The same is true for most of the weird stuff I encounter: I keep it in the back of my head for years until I finally run into the problem which justifies using it. I think this is an important programming skill: being able to recognize when a tool you know of but don't know solves a problem you have.

It's the opposite of Paul Graham's Blub paradox:

Blub falls right in the middle of the abstractness continuum. … when our hypothetical Blub programmer looks in the other direction, up the power continuum, he doesn't realize he's looking up. What he sees are merely weird languages. He probably considers them about equivalent in power to Blub, but with all this other hairy stuff thrown in as well. Blub is good enough for him, because he thinks in Blub.

When we switch to the point of view of a programmer using any of the languages higher up the power continuum, however, we find that he in turn looks down upon Blub. How can you get anything done in Blub? It doesn't even have y.

I always thought it was funny that Paul ends this with "and so Lisp is the most powerful language", as if he's not himself a Blub programmer. Regardless, this is definitely a real paradox I've experienced. A programmer once told me that Chapel was just a Julia knockoff, when he had never seen Chapel before and only studied Julia for a week. It's just really hard to think of how things you don't already know might help you. But if you can do this, you're more likely to learn the right tool for the job instead of sticking with the wrong one.1

Some Examples

I categorized Picat as "logic + imperative + constraint solving". I figured it'd be useful if I needed to do one of those things with a sprinkle of another. In the end I was doing a constraint problem and needed "a more expressive constraint solver", which means constraints + a sprinkling of LP. So I learned Picat and it solved the problem. As part of exploring the language I discovered planning, which is now another tool on the backburner for when I hit the right problem.

Sometimes it doesn't work out. Either the unknown tool doesn't actually solve my specific problem or it works in theory but has a problem in practice. At one point I wanted to analyze a state space graph and remembered that Gephi was a graph analysis software suite. But when I actually learned it, I found it was too limited in capabilities to be useful.

That's one thing that worked and one thing that didn't. Something that I haven't tried yet is lisps, in particular macros. I regularly run into problems which seem solvable with macros:

  1. A case statement that automatically raises an error if nothing matches
  2. A special if statement that treats domain-specific values as true or false, which my language doesn't support
  3. Defining multiple functions that all share a specific set of keyword parameters

Do any of these justify learning a lisp? Nah. They might even be terrible ideas in practice! But at least it keeps me thinking about the uses of tools I don't know.

How to develop the skill

Well the first step is to read about lots of different software topics. That's not enough, because you can go on a grand sneer, so you have to be open to thinking about how those topics could augment what you already know.

Past that I have no idea. I think my experience comes from being a teacher. Whenever I learn of a new topic I think about what motivating examples I'd use to teach it, and whenever I come across something "interesting"2 in real life I pattern-match against a bunch of topics to see if it'd make a good example. I just find explaining things fun.


Bonus: LLMs for querying data

As part of the graph project I wanted to see if Neo4j was getting more or less popular. I figured one good proxy would be the number of StackOverflow questions asked per month. It's imperfect (maybe fewer questions = lowest fruit already picked) but good enough for my purposes. StackOverflow's data is public. My blocker is the same thing with all sources of public data: I don't know the schema! But maybe GPT4 does:

Give me a query that returns the number of stack overflow questions tagged "neo4j", by month. Use the schema in the "stackexchange data explorer"

SELECT
    FORMAT(CreationDate, 'yyyy-MM') AS [Month],
    COUNT(*) AS [Number of Neo4j Questions]
FROM
    Posts
WHERE
    Tags LIKE '%<neo4j>%'
    AND PostTypeId = 1  -- 1 indicates a question
GROUP BY
    FORMAT(CreationDate, 'yyyy-MM')
ORDER BY
    FORMAT(CreationDate, 'yyyy-MM') ASC;

The result worked! This could be a great use of LLMs! Drunk with power, I tried it the final boss of public datasets, wikidata:

Give me a wikidata query for "number of dogs for each dog breed". No explanation, just the query.

SELECT ?breed ?breedLabel (COUNT(?dog) AS ?numberOfDogs) WHERE {
  ?dog wdt:P31 wd:Q144;
       wdt:P31/wdt:P279* ?breed.
  ?breed wdt:P279* wd:Q144.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
GROUP BY ?breed ?breedLabel

This worked... less well. It doesn't deduplicate dogs, and it gets all dogs of all types, not breeds. We actually want the line ?breed wdt:P31 wd:Q39367.3 Other experiments show that GPT often gets the wikidata tags wrong, sometimes giving stuff that doesn't correspond to anything.

That said, I was still able to kludge something mostly workable in ten minutes by modifying the LLM output, which is still a lot faster than I could have done on my own. This follows my general heuristic for applying AIs: use it on things that are hard to do, easy to check, and easy to fix.


  1. Though using the wrong tool can work just fine. And sometimes it's not worth the time to learn something you don't know if it'll only be slightly better than what you have now. 

  2. You know, flight layovers, self-checkout machines, rat-shaped potholes, bank teller queues... 

  3. This doesn't get all of the data either, but this is a problem with Wikidata and not the LLM. Many specific breeds are listed as instances of dog type (Q38672), which also includes things like "hunting dog" and "sighthound". 

Planning vs Model Checking

20 February 2024 at 23:16

New blogpost! Planner programming blows my mind, Patreon here. Next essay out should be the graph project.

The post is about Picat and more specifically, planner programming. Planning very roughly is:

  1. You provide a start state, a set of goals, and a set of state transitions (actions),
  2. The planner uses the actions to generate a state space,
  3. The planner reports a sequence of actions that reaches a goal, if any.

A lot of formal methods stuff like TLA+ is verified with model checking, which very roughly is:

  1. You provide initial states, a set of properties, and a set of state transitions
  2. The model checker uses the transitions to generate a state space,
  3. The checker reports a sequence of actions that violate a property, if any.

These look extremely similar, and in fact they are mathematically the same thing. Any model checker can find a plan that reaches goal P by writing !P as a property to check. Then the "violation" returns a trace that ends with P. That's how you can use TLA+ to solve the diehard puzzle.

Similarly, any planner can be turned into a model checker that by making the goal !P, and then failure to reach that goal means P is a valid property. And in fact that's how the Z3 theorem prover proves properties: to determine if P is true, you check that !P has no satisfying solutions.1 I believe Alloy does the same thing, running as both a "model finder" and a "model checker".

Planning and model checking are "isomorphic" to each other, a tool for one is identically a tool for the other. Any differences between the two are in the engineering of planners vs model checkers. But these differences are quite significant, because the approaches come from completely different fields. Planning comes from robotics and AI, and (to my understanding) is predominantly used in video games. Model checking comes from formal methods. These fields are making the same class of tools but for very different reasons, and this leads to a lot of practical differences in what these tools do.

Planning

(I have limited experience with planning and am basing most of this off this page, PDDL, and some github research.)

Planning is mostly done for finding a plan to be immediately executed by a robot or game AI. There's three complications to planning problems not seen in model checking:

  1. It's not enough to find any plan, you want to find the cheapest plan. So planners can assign a "cost" to each action and then find plans with minimum costs. Most model checkers, if they do minimization, only minimize the number of steps in a trace.
  2. The planning may need to be run for many different agents in realtime. There's more emphasis on being able to tune the planner's performance: swap in a different search strategy, fine tune cost heuristics, etc.
  3. Planning can be "dynamic", in that world-state can change while an agent is undergoing a plan, in which case it needs to make a new plan. I don't know how this is generally handled.

Model checking

Model checking is mostly done for finding a bug in software to be fixed. If a model check finds a bug, someone has to read the error trace and then try to make changes to the software to fix it. So you see more emphasis on making legible traces, such as by generating visualizations.

There's a lot of different kinds of bugs in a system, so model checkers offer a lot of flexibility in expressing properties. Most of the planners I found look for end states;2 these correspond to "invariants" in a model checker, properties that hold true at all states. By contrast, most model checkers also accept a wide range of temporal operators, giving properties over sequences of states. Examples are:

  • The counter can only increase, never decrease.
  • The boot sequence cannot start until the switch is flipped, and once it starts, it must complete within five seconds.
  • The database is eventually consistent.
  • Every thread always gets a chance to make progress on its task.

Many model checkers also have interactive or simulation modes, to help programmers better understand the systems without them having to check for specific properties.

Two thoughts on this

Commands vs Queries

Best explained here. Model checking is primarily query-centric (you use it to gather information about the system), planning is primarily command-centric (you use it to set off a bunch of state changes).3 This influences how the tools are engineered: simulation makes sense for gathering information but not effecting change; effecting change means running lots of plans at once, etc.

Engineering matters

Two things can have the same underlying theory but be engineered so differently that one is a poor substitute for the other. Not just in features but performance characteristics, debuggability, etc. Neither is making the wrong choice, both are just adapted to a context. This is a cause of software mimicry.


  1. Technically z3 isn't a model checker so the analogy doesn't fit, but it's too cool a fact not to include here. 

  2. PDDL 3 can express temporal preferenes but adoption is limited. 

  3. Not 100% of the time, of course! In the blog post I use planning to query a constraint problem. 

Strings do too many things

8 February 2024 at 15:30

No Newsletter next week

TLA+ Workshop and moving places.

Strings do too many things

In the unusual basis types email1 I wrote this about strings:

We use strings for identifiers, human writing, structured data, and grammars. If you instead use symbols for identifiers then you can be more confident a given string isn't an identifier.

I think this is a useful concept and want to explore it a bit more. Consider the following pseudocode:

query = "INSERT INTO " + tablenames["comments"] + " VALUES ('f1rst p0st', 'john@example.com');"

In that line,

  • "comments" is an identifier. We don't care about anything besides that it's distinguishable from other identifiers. This category also includes enumerations, like when the states of a light are ("red", "yellow", "green").
  • "f1rst p0st" is human writing. It means nothing to the system and something to the people that interact with the system. Most text in the real world is human writing.
  • "john@example.com" is structured data. We store it because this database doesn't have a special "email" type. In a programming language we'd instead store it as Email(local="john", domain="example.com").2
  • The entire query is a grammar. It's a snippet of the SQL language. For that matter, the whole line is also a piece of grammar which we store in the program's text file.

(The line between "structured data" and "grammar" is really fuzzy; is a CSV data or grammar? Maybe making a distinction isn't useful, but they feel different to me)

When you see a string in code, you want to know what kind of string it is. We use these strings for different purposes and we want to do different things to them. We might want to upcase or downcase identifiers for normalization purposes, but we don't split or find substrings in them. But you can do those operations anyway because all operations are available to all strings. It's like how if you store user ids as integers, you can take the average of two ids. The burden of using strings properly is on the developer.

Almost all programming languages have an ASCII grammar but they need to support Unicode strings, because human writing needs Unicode. How do you lexographically sort a list of records when one of them starts with "𐎄𐎐𐎛𐎍"?

Other problems

I think the conflation of identifiers and strings is "merely" "annoying" in that it adds more cognitive burden.3 The real problems are with data and grammars.

  1. There's no way to tell a string's nature without context. Is "John wears a hat" human writing, or a kind of data triple, or some kind of unusual source code? Is the triple "{wears} {a hat}" or "{wears a} {hat}"?
  2. Strings of one nature can contain strings of another nature. element=<pre>2024-02-07,hello world</pre> has structured data inside a grammar inside a grammar inside a grammar. And it's all inside human writing inside a markdown grammar that'll be parsed by buttondown's email generator. I will be astounded if it renders properly for everyone.
    • We often want to store human writing in data/grammars, which means bolting an extra grammar on top of writing. JSON can't contain multiline strings; is the value "Thank you,\nHillel" a string with a linebreak or a string with the text \n?
  3. Grammars and structured data can be written incorrectly. If the program misunderstands the nature of string, valid data can be transformed into invalid data. Lots of simple programs break on Windows because they interpret the path separate (\) as an escape sequence.
  4. Security: SQL injection is possible because the fragment "WHERE id="+id expects id to contain an identifier, but someone gets it to contain SQL code instead.

The bigger problem: serialization

Given the problems with strings doing too much, the direct approach is to offload stuff from strings. Some languages have native types like symbols; otherwise, you can make special types for identifiers and each different kind of data. Grammars are trickier to offload; I usually see this done as composable functions or methods.

Then you get the problem of getting information out of the program. Files are strings. HTTP messages are strings. Even things like booleans and numbers now are strings. Strings are the lowest common denominator because all languages understand strings. C doesn't know what an abstract base class is but it does know what a string is.

I guess for interprocess communication you've at least got stuff like protobuf and thrift. Otherwise, you could try throwing another grammar in the mix? Instead of storing the email as "john@example.com", store it as

{
  "type": "email",
  "value": "john@example.com"
}

See also: JSON Schema.

Anyway this is all just another example of "push errors to the boundaries of your code". The real world might be stringly typed, but your program internals don't have to be.


  1. Thanks to everybody who sent me other unusual basis types! And the people who pointed out that I forgot sum types as a (near) universal. Feel like an idiot for missing that one 

  2. This is just a demonstrative example, in the real world A B <@C :"D E"@F -G-H!> is a valid email. 

  3. And while writing this I thought of a really annoying case with identifiers as strings: I've worked with a few APIs that take options as a string. Some read "foo bar" as the two identifiers foo and bar, and some read "foo bar" as the single identifier foo bar

What does 'TLA+' mean, anyway

30 January 2024 at 18:08

TLA+ Workshop

Feb 12th. I've decided to reduce the class size from 20 to 15, so there's only a couple of slots left! I'll be making a little less money this way but it should lead to a better teaching experience for the attendees. Use the code NEWSLETTERDISCOUNT for $100 off.

What does "TLA+" mean, anyway

It's a running joke in the TLA+ community that we're afraid of saying what "TLA+" means, lest it scares people off. It's not on the front page of Lamport's Website, or on the github repo, or on learntla (though I put it in the FAQ). So here we go:

TLA+ stands for "Temporal Logic of Actions, Plus".

Okay but that doesn't explain anything. So let's talk about what that actually means. I'm going to oversimplify both the history and math to avoid spending two more days on fact checking.

Modal Logics

A "logic" is a mathematical framework for reasoning about truth statements, like how "arithmetic" is a framework for reasoning about numbers. There are lots of logic but the one we're all familiar with is propositional logic, which is just manipulating raw true-false values. Propositional logic is what gives us stuff like

!(x || !y) == !x && y

Which we use to simplify our conditionals. Then comes predicate logic, where you add in the ability to say "P(x) is true for all x" or "for some x." Predicate logic is used almost everywhere in math. I'm writing a short book/long pamphlet on predicate logic, I think it's underused in programming.

Less common are the various modal logics, which is add "modes", or qualifiers. □P means that "P is necessarily true" and ◇P means that "P is possibly true". Note that ◇P is defined as !□!P: P is possibly true if it's not necessarily false.

What does "necessarily true" mean? Ah, there's an interesting question. "Necessity" means different things in different contexts:

  • When talking about "possible worlds", □P could mean that P is true in all possible circumstances, while ◇P is true in some reality. □(2+2=4), ◇(unicorns are real).
  • When talking about proof, □P could mean that we have evidence that P is true, while ◇P means we don't have evidence that P is false. □(unicorns don't exist), ◇(aliens exist).
  • When talking about ethics, □P could mean that P is something you must do, while ◇P is something we may do. □(don't punch babies), ◇(kick unicorns).

A cool one I ran into at SPLASH 2021 was belief programming for drones. Let's say you have an altimeter that's accurate to within 10 meters. If it reads that you're 100 meters off the ground, then □(height >= 90) while ◇(height >= 109).

In formal methods, the most interesting mode is time, because you can use that to mode to represent algorithms.

Temporal Logics

Say that "necessarily P" means "P is always true" and "possibly P" means "P is sometimes true", and we get temporal logic. Or, more precisely, a temporal logic, because there's quite a few. One of the big questions comes from whether to model a coin flip as two totally separate timelines or a single timeline that branches in two. This matters because they lead to different interpretations of "sometimes". If you flip a coin once, is ◇heads true?

  1. Linear model: No, because there's two timelines, and in one of them, you flip tails.
  2. Branching model: Yes, because there's one timeline that branches, and in one branch you flip heads.

In 1980 Lamport wrote the influential paper "Sometime" is sometimes "no never" where he argued that linear models are better for modeling concurrent programs.1 The next year Clarke and Emerson would announce the first model checker, which checked a branching time logic. So both approaches are really important in the history of formal methods, but Lamport preferred the linear time, which decided the rest of this story.

Anyway, the dominant linear model of temporal logic is called... linear temporal logic.

Linear Temporal Logic

Time is broken down into a sequence of abstract steps, with three main temporal operators:

  1. G/□: true now and in all future steps.
  2. F/◇: true now or in at least one future step.
  3. N/X/◯: true in the next step.

(All of them have ASCII alternatives because computer scientists hate weird symbols as much as other mathematicians love them. Also there's some more operators like Until and Release that aren't strictly necessary so I'm leaving them out.)

Here's how we can model "x is true in this state and false in the next state":

x && N(!x) 

Here's the boolean flipping between true and false in the next step:

(x && N(!x)) || (!x && N(x))

Here's a boolean that starts true and flips every step:

Flip(x) = (x && N(!x)) || (!x && N(x))
Spec = bool && G(Flip(bool))

Here's a bicycle blinker which turns on when you click it once, blinks while it's on, and turns off when you click it again.

Flip(x) = (x && N(!x)) || (!x && N(x))
Init = !light && !active
TurnOn = !active && Flip(active) && N(light)
TurnOff = active && Flip(active) && N(!light)

Spec = Init && G(
    TurnOn
||  TurnOff
||  (active && N(active) && Flip(light))
)

I think I got that right! There's programs that convert LTL fragments if you feel like showing how I'm wrong.

Action Logics

LTL is arguably the most widespread temporal logic, and lots of formal methods languages use it for expressing properties. But it's not commonly used for expressing entire systems. There's lots of very specialized reasons for this and the most easily-explainable one is that updating variables sucks. Say x is an integer you want to increment. You'd want to write something like N(x) = x + 1, but you can't do that, N(x) returns a boolean. You could also try N(x = x + 1), but that doesn't work because both x's in the N refer to the next value of x. There's no way to get one side of the = to be the current value of x and other side to be the next value.

Instead you have to do something like this:

some z: Int | (z = x + 1) && N(x = z)

Oooooor we could just define x' to be the "the value of x in the next state"

x' = x + 1

You read x' as "x-prime". Using prime to mean "the next value" goes at least back to Turing, but mathematicians and physicists were using it in the same-ish way for far longer.

Boolean formulae with primes are called actions. Lamport claims to be the first person to make an action-based temporal logic and I haven't found any information that refutes him. Epistemically ◇(he was the first), though I can't assert □(he was the first).

Anyway, Lamport published his new logic in 1990 and then more thoroughly in 1994. He called it the "Temporal Logic of Actions", or TLA.

PLUS

TLA is the formal logic: what model of time we're using, what □ and ◇ mean, what makes a valid formula, etc. TLA+ is the specification language built on top of the formal logic. It adds things like conditionals, sequences, integer arithmetic, etc. You could design a totally different specification logic using the same TLA core. Or you could embed just the TLA stuff in a theorem prover and leave out the + stuff.


  1. ¿Por qué no los dos? You can have a model that's contains both linear and branching timelines, like CTL*. The main downside is that it leads to much more difficult computational problems. Satisfiability, for example, is 2EXPTIME-complete

GitHub Search for research and learning

24 January 2024 at 16:24

Hi everyone!

I have a new blog post out: An RNG that runs in your brain. It's a mix of cool tricks and math analysis done with an exotic gremlin language. Patreon is here. Also TLA+ workshop on Feb 12 etc etc use the code NEWSLETTERDISCOUNT for $100 off etc

Anyway I've been all over the place this week wrt contracts, writing projects, and errands where no matter what I'm working on it feels like there's something more important that I should be doing instead. One of the projects I'm jumping between is the "Why don't we have graph types" writeup. I'm trying (trying) to get the first draft done by the end of the month, and as part of that I'm researching how people use graph libraries in production codebases. And that means a lot of time with GitHub code search.

Code search is amazing. First of all, it covers 200 million public repositories. Second, it can filter on language and filepath. Third, it supports regular expressions. To find files that use Python's graphlib package, I just need to search

/(from|import) graphlib/ language:python

After that I can sift through the results to separate real-life use cases from false positives and toy problems.

I also used search for another project I'm jumping between, the logic book. One programming trick you get from logic is that you can rewrite not (all x: P) or (some x: Q) as (some x: Q or not P), which is clearer and faster to evaluate.1 I wanted a real-life example of code that can be rewritten this way. So I searched

/all\((.*)\) or any\(.*\):$/ language:python

And got like 300 code samples that could be simplified. I used a similar search to get examples of nested quantifiers for a different topic in the book. Now if only I could stop researching and start writing...

Code search is potentially even more useful as an education tool. You can learn a lot about how something works by reading code that uses it. For example, let's say you're using hypothesis for property-based testing, and you want to get better at composites. You can find examples by searching

/(from|import) hypothesis/ composite language:python path:test_*.py

Oh yeah you can include multiple regexes. Want to find files that have two composites? Replace composite with /composite(.|\s)*composite/. Just about the only things that don't seem work are group matching and lookarounds like (?!).

So far I've just done searches from the website, but there's also a code search API. You could feasibly run a query, dump a hundred matches into files, and then do more complicated filtering and munging before spitting out a courseload's worth of case studies.

You can read more about the search syntax here. Try it out!


  1. It's even better when you can rewrite not P in a clear way! Like replacing not (x < 0) with x >= 0

Unusual basis types in programming languages

17 January 2024 at 16:00

TLA+ Workshop

TLA+ workshop on Feb 12! Learn how to find complex bugs in software systems before you start building them. I've been saying the code NEWSLETTERDISCOUNT gives $50 off, but that's wrong because I actually set it up for $100 off. Enjoy!

Unusual basis types in programming languages

Here are the five essential "basis" types in all modern programming languages:

  1. Booleans
  2. Numbers
  3. Strings
  4. Lists
  5. Structs

(I'm talking about these as general overarching concepts, so numbers includes doubles and integers, lists include arrays and tuples, etc. Languages often support several types in each category.)

These types are always supported with dedicated syntax, as opposed to being built out of the language abstractions. Some domains have additional basis types: systems languages have pointers, dynamic languages have key-value mappings, FP languages have first-class functions, etc. Outside of these, languages rarely add special syntax for other types, because syntax is precious and you can build everything out of structs anyway.

So it's pretty neat when it does happen. Here are some unusual basis types I've run into. This list isn't comprehensive or rigorous, just stuff I find interesting.

Sets

Seen in: Python, most formal methods languages

Sets are collections of unordered, unique elements. Sets are nice because you can add and subtract them from each other without worrying about duplicates. Many languages have them in the standard library but Python also gives them special syntax (shared with dictionaries).

# Python
>>> ["a", "b"] + ["a", "c"]
['a', 'b', 'a', 'c']
>>> {"a", "b"} | {"a", "c"}
{'c', 'b', 'a'}
>>> ["a", "b"] - ["b"]
TypeError: unsupported operand type(s) for -: 'list' and 'list'
>>> {"a", "b"} - {"b"}
{'a'}

Lots of things we use lists for are better represented with sets. Using a set tells the reader that ordering and duplicates don't matter to your program, which you can't always figure out from context. They're probably the thing I miss most in the average language, with symbols coming in a close second.

Symbols

Seen in: Lisps, Ruby, logic programming

Symbols are identifiers that are only equal to themselves. Most languages use strings for the same purpose, the difference being that strings can be manipulated while symbols can (in essence) only be compared.1

# Ruby
> puts "string1" + "string2"
string1string2
> puts :symbol1 + :symbol2
undefined method `+' for :symbol1:Symbol (NoMethodError)

So why use symbols if you already have strings? In Ruby it's because strings are mutable while symbols are not, meaning symbols are more space-efficient. This can really add up if you're making a lot of hash tables. The broader reason is that strings already do too many things. We use strings for identifiers, human writing, structured data, and grammars. If you instead use symbols for identifiers then you can be more confident a given string isn't an identifier.

Some logic programming languages— at least those that stem from Prolog— have symbols (called atoms) and not native strings. SWI-Prolog added proper strings in 2014.

Pairs

Seen in: Raku, Alloy kinda

Raku actually has a few different unusual primitives, but pairs are the one that's most generally useful.2 A pair is a single key-value mapping, such that a hash table is just a set of pairs.

> (1 => 2){1}
2 
> {"a" => 1, "b" => 2}.kv
(b 2 a 1)

I like pairs a lot. First of all, they satisfy the mathematician in me. You can pull the element out of a list, so why not pull the kv pair out of a mapping? Raku also does a lot of cool stuff with pairs. For example, :$x is shorthand for 'x' => $x and :foo is short for :foo => True. These make it easy to add optional flags to a function. I think keyword arguments are implemented as pairs, too.

Alloy also has sets of pairs ("relations") and a lot of cool ways to manipulate them. ^rel returns the transition closure of rel, while set <: rel returns only the pairs in rel that begin with elements of set.

N-arrays

Seen in: APLs (like Dyalog, J, BQN), maybe Matlab?

In most languages a "2D" array is just an array containing more arrays. In APLs, a 2D array is just an array with two dimensions. Unlike the list-of-lists model, there's no "preferred axis", and you can mess with columns just as easily as you mess with rows.

   NB. J
   ]x =: i. 4 4
 0  1  2  3
 4  5  6  7
 8  9 10 11
12 13 14 15

   NB. first row
   0 { x 
0 1 2 3

   NB. first column
   0 {"1 x
0 4 8 12

   NB. grab the middle
   (1 1 ,: 2 2) [;.0 x
5  6
9 10

Other things you can do: sort the rows by a specific column. Sort the columns by a specific row. Transpose the array so the rows are now columns. Reshape it into a 2x8 array. Multiply every element by 6.

Arrays can have any number of dimensions, though I find more than 2 dimensions awkward. If I have 3 or more I'm usually better off with a list of 2D arrays (via boxing). But 2D arrays are quite useful! Which is why the most popular 2D array by far is the dataframe.

Dataframes

Seen in: lil, nushell, Powershell, R (I think?)

Sort of a mix of a 2D array, a mapping, and a database table. Allow querying on both indexes (rows) and column names.

Python does not have dataframes as a basis type but pandas is the most popular dataframe library in use, so I'll use that for the demo:

import pandas as pd
df = pd.DataFrame({
    'airport': ['ORD', 'LHR', 'MEL'],
    'country': ['USA', 'UK', 'Australia'],
    'year': [1944, 1946, 1970]
})

>>> df['airport']
0    ORD
1    LHR
2    MEL

>>> df.iloc[1]
airport     LHR
country      UK
year       1946

>>> df[df['year']>1960]     
  airport    country  year
2     MEL  Australia  1970

Datatables are unsurprisingly extremely popular in data science, where you want to slice and dice data in all sorts of weird ways. But you also see it in some shell languages because being able to sort and filter ps/ls/grep is really really nice.

Things missing

  • I don't know any stack or assembly languages, and I didn't look at NoCode/LowCode environments or DSLs.
  • I bet there's a lot of unusual stuff in Mathematica, but I'm not paying $400 just to check.
  • There's a lot more basis types I've heard of in languages I don't know: Go channels, lua userdata, SNOBOL patterns, raw bytes and byte streams.
  • No syntax for graphs, which is unsurprising because there's no library support for graphs either. Working on the writeup of why now.

Feel free to email me with weird basis types you know about that I didn't list here. The criteria is pretty loose but the very broad qualifier is "is there syntax specifically for this type?"


  1. In essence. Ruby supports a bunch of methods like upcase on symbols, which first convert it to a string, then upcases it, then converts it back to a symbol. 

  2. The other really unusual type is the junction, which is a sort of superposition of boolean conditionals. They exist you can do if ($x == 3|5) instead of if $x == 3 or $x == 5

Why all([]) is true, prod([]) is 1, etc

9 January 2024 at 18:05

New Blogpost

The World and the Machine. It's about how to model systems better, and while it's written in TLA+ the advice should apply regardless of your speclang. Patreon is here.

Speaking of TLA+, I'm running a TLA+ workshop on Feb 12th! I'm real excited about this one because I wrote a new lesson on the TLA+ collection types which should make the whole process a bit smoother. Use the code NEWSLETTERDISCOUNT for $50 off.

Why all([]) is true, prod([]) is 1, etc1

This is inspired by this article. Some Python list-ops do unusual things with empty lists:

>>> all([])
True
>>> any([])
False
>>> math.prod([])
1
>>> sum([])
0

You can find in-depth explanations for each of these by itself, like the empty product. But all of these can be explained as consequences of the same mathematical principle. Let's say we have two lists xs and ys, and let + be list contatenation. We expect the following to always hold:

prod(xs + ys) = prod(xs) * prod(ys)

IE prod([2,4,6,8]) = prod([2, 4]) * prod([6, 8]).2 You can see why we'd want this to be true, as it means that prod behaves in a reasonable way regardless of how big our list is. But we also have an edge case:

prod(xs) 
  = prod(xs + [])
  = prod(xs) * prod([])

So for any xs we must have prod(xs) = prod(xs) * prod([]), which is only possible if prod([]) = 1. We can apply similar reasoning for other list reductions:

sum(xs) = sum(xs) + sum([]) = sum(xs) + 0
all(xs) = all(xs) and all([]) = all(xs) and True
any(xs) = any(xs) or any([]) = any(xs) or False

Simple, right?

But what about max?

Of course there's a problem. max(xs + ys) is obviously max(max(xs), max(ys)). So by the same reasoning we should have

max(xs) = max(max(xs), max([]))

But max([]) can't be any number. Let's say we decide it's 0, then we have

max([-1]) = max(max([-1]), 0) = 0

We can't define the maximum of an empty list! Then what's special about prod that makes an empty product possible when an empty max is not?

What's "special" is that prod has an identity: some value a such that x * a = a * x = x (1, for prod). The mathematical term for "a set of elements with an associative binary operation that has an identity" is a monoid. For any monoid, we can take the operation and create a version that applies it to a list, and then applying that to the empty list should return the identity:

listop(xs)
  = listop(xs) op listop([])
  = listop(xs) op iden
  = listop(xs)

The set of booleans form a monoid under or, which has the identity element False, so any([]) = False. max does not have an identity, so is not a monoid, so we can't meaningfully define it for an empty list.3

"A version that applies it to a list"...

...is just reduce. prod(xs) is reduce(*, xs, 1), any is reduce(or, xs, False), etc. Most languages require you to provide an initial value for the reduce (ie reduce(max, xs, xs[0]), but if you're reducing over a monoid then you shouldn't actually need it, as you can provide the identity as the default initial value.

And in fact this is (sort of) what Haskell does. In Haskell, foldl/foldr over lists require a default value, but if you know you're working over a monoid, you can use mconcat without one, as it can use the identity as its default.

>>> mconcat [1,2,3] :: Product Int
Product {getProduct = 6}
>>> mconcat [] :: Product Int
Product {getProduct = 1}

Some other examples of monoids:

  • Functions form a monoid under function composition, with identity f(x) = x.
  • NxN matrices under matrix multiplication, with identity "the NxN identity matrix".
  • Strings under string concatenation, with identity "".
  • Dictionaries under dict-merge (left or right), with identity {}.

For all of these you shouldn't need to define a default value when reducing in principle, though getting a non-Haskell language to automatically determine that you're reducing over a monoid is more trouble than it's worth.

I think it's interesting that most of those examples are collection types and not primitives. Probably just because there are more collections than primitives.


  1. The original title was "why any([]) is true", which was a dumb typo on my part. As the body of the essay discusses, any([]) is false. I meant "all([]) is true" and fixed the title of the post. 

  2. The technical term for this is that product should be homomorphic. An example of a non-homomorphic function would be mean: mean(xs + ys) != mean(mean(xs), mean(ys)) 

  3. At least not over a normal set of numbers. There's an identity if you're doing modular arithmetic, in which case you're fine defining the empty max. Also no less than five readers pointed out that you can use -infinity as the identity for max. ♥ u guyz 

Advice for new software devs who've read all those other advice essays

19 December 2023 at 23:40

Someone recently asked me if I had advice for early-career programmers. At first I thought this was a silly question. I only entered the workforce ten years ago; many of my newsletter subscribers have been programming for longer than I've been alive!

Then I went and read some "advice for starting programmer" essays and thought of some things they missed. So here's thirteen bits of advice for early-career programmers. Some of it is contradictory.

  1. People don't listen to me because I'm a good programmer, they listen to me because I'm a good writer. The same is true of pretty much everybody you'll read. This doesn't mean you should automatically reject everything, but it means you should carefully think about it and evaluate how it applies to your situation. And take any argument about "objective truth" with a grain of salt: there is very little about software that's been scientifically studied, and most of the research is inconclusive.

  2. But also don't worry too hard about getting "tricked" or learning "the wrong thing". If you like someone's ideas, try them out! As long as you're not actively sabotaging your coworkers then it'll probably work out in the end, even if you look back and think "I should have done that different." That's what learning is all about!

  3. Read the book Debugging: The 9 Rules. Borrow it from the library or ask your company to buy a copy, whatever. It's a real easy read and teaches an important skill all the other "beginner programmer" books barely cover.

  4. At some point you will discover the Right Way to program, the thing which makes this all make sense, and you'll be convinced that the whole field would be so much better off if everybody else programmed the Right Way, too. For me the Right Way was test-driven development; for you it might be functional programming, lisp, formal methods or one of a million other things.

    I'm not going to tell you to not get swept up in the Right Way, because that's pretty much impossible. And honestly it feels really great to discover the Right Way and life's too short to not feel good. Just be mindful of the fact you're getting swept up and try not to make your identity the Right Way Guy. Eventually the honeymoon will end and you'll learn that programming is frustrating and messy regardless of which Right Way people use, and that you can also make great software without doing it the Right Way. Over time you'll learn fifty other Right Ways and learn to mix and match them to the problem at hand.

  5. When you first encounter the Right Way, it will likely be from someone who went full Right Way Guy. Try not to hold it against them later. And try not to conflate the actual technique with how the RWG pitches the technique. Most ideas need some modification from their purest form to integrate well with other ideas.

  6. Julia Evans once said "behind every best practice is a horror story." If you don't understand a Best Practice, look for the horror story that inspired it. It might make the best practice make sense. It might turn out to be something that's completely irrelevant to you, and then you can feel comfortable doing a different practice instead.

  7. Building on the last tip: a lot of best practices and conventions are "path dependent", arising from a mix of historical and cultural factors. There are things we do because our mentors do it, who do it because their mentors did it, who did it to address issues that aren't as relevant anymore. If something sounds like a just-so story, it very well might be. You can often retrace the whole path if you're willing to look.

  8. Take walks.

  9. Almost every tool you use has some form of hidden depth, from your programming language to git to JIRA. Don't feel like you have to become an expert in every single one, but do consider spending 5-10 minutes learning a bit more about what it can do.

  10. Talk to people in other parts of the company: support, business domain, sales, etc. Consider shadowing them if you have the time (and feel comfortable asking). You'll be surprised by what you learn!

  11. If possible, try to do a few different types of programming earlier in your career. This doesn't have to mean switching jobs: most companies are doing several different kinds of programming at once. So like if you're starting in a webdev company, try some frontend, some backend, some operations, some database stuff, and so forth. This helps you learn, but far more importantly increases your chances of finding a kind of software work that you really really like. My first job was as a frontend dev and I was miserable. Later I moved to backend and was much happier, as were the people who wanted to spent more time doing frontend!

  12. You've probably heard the advice that software as a field is changing all the time, and that you shouldn't get swept up in the framework treadmill, just focus on learning fundamental skills. This is a true, but doesn't explain the "why". For structural reasons, information in software propagates really quickly. This is due to a lot of factors (internet, open source, conferences) but overall there's a lower barrier to sharing ideas in software. So it's really easy to get a lot of people to know about someone's pet project, even if only one person uses it.

    The upshot of this is that a lot of the technologies you hear about have very small userbases and will never get wide adoption, but it won't seem that way from how you hear about it. That's why it makes sense to be conservative. If you hear about something that gets you excited, go ahead and be an early adopter, otherwise it's okay to wait a couple years to see if it has legs.

  13. Ultimately none of us can predict the future, just as none of us could have predicted the present. Just try to do the best you can, live according to your values, and enjoy the ride.

That's all from me for the year; new workshop dates when I get back. See you in 2024!

I formally modeled Dreidel for no good reason

12 December 2023 at 20:14

New blog post

Notes on Every Strangeloop 2023 Talk I Attended. Would have been out earlier but I was busy and sick. Patreon is here.


I formally modeled Dreidel for no good reason

It's the sixth night of Channukah and I'm thinking about dreidels:

A dreidel

On Channukah we use dreidels to play the game dreidel:

  1. Every player starts with N pieces (usually chocolate coins). This is usually 10-15 pieces per player.
  2. At the beginning of the game, and whenever the pot is empty, every play antes one coin into the pot.
  3. Turns consist of spinning the dreidel. Outcomes are:

    • נ (Nun): nothing happens.
    • ה (Hey): player takes half the pot, rounded up.
    • ג (Gimmel): player takes the whole pot, everybody antes.
    • ש (Shin): player adds one of their coins to the pot.
  4. If a player ever has zero coins, they are eliminated. Play continues until only one player remains, or (much more likely) everybody gets bored.

Traditionally this game is played by kids, as it's an excellent way to convince them that gambling is boring as all hell. In 2015 Ben Blatt simulated the game and found that for P=4 and N=10, the average game will take 860 spins to finish.

But this was based on a mere 50,000 simulated games! We can do better than a mere simulation. We can mathematically model the game and get exact numbers. So we're busting out PRISM.

Modeling Dreidel in PRISM

PRISM is a probabilistic model checker. You create a model of a probabilistic system, like queueing workers or a gambling game, and PRISM can give you exact probabilities of various events happening. I've written two blog posts on PRISM:

If you read those blog posts, you'll quickly learn that I really do not like PRISM. The problem is that it is incredibly limited language. It doesn't have strings, or arrays, or functions. This makes generalizing or abstracting things virtually impossible, and you're going to be hardcoding a lot of stuff. Despite all that, I keep finding excuses to use PRISM anyway. I've got an enormous number of small PRISM tricks I will never share with anyone because nobody else will ever use PRISM. Maybe there's a grad student who needs them, I dunno.

I'm not going to go through the entire model, you can see it here. Just some highlights:

formula Players = 4; //Players
formula maxval = M*Players;
formula done = (p1=0) | (p2=0) | (p3=0) | (p4=0);

Since I can't abstract over anything, I can't make a general model that works for P players, so instead I hardcode in 4 players. Also, properly modeling "playing until one player is left" would have been a nightmare, so instead I'm just modeling until one player is eliminated.

    [spin] ((pot != 0) & !done & (turn = 1)) ->
      0.25: (p1' = p1-1) & (pot' = min(pot+1, maxval)) & (turn' = 2) //shin
    + 0.25: (turn' = 2) //nun
    + 0.25: (p1' = min(p1+halfpot,maxval)) & (pot' = pot-halfpot) & (turn' = 2) // hay
    + 0.25: (p1' = min(p1+pot,maxval)) & (pot' = 0) & (turn' = 2); //gimmel

This is the core of a PRISM spec: a set of conditions for this to happen, a set of weighted outcomes, and the updates for each outcome. Though it's mathematically impossible for a shin to make the pot have more than maxval coins, whatever model checking algorithm PRISM uses couldn't tell that, so I had to manually cap it with (pot' = min(pot+1, maxval)). I spent a lot of time on this spec coddling the model checker.

Also notice that I had to hardcode that this is all for player one (p1). This means that I have to write this action three more times:

    [spin] ((pot != 0) & !done & (turn = 2)) ->
      0.25: (p2' = p2-1) & (pot' = min(pot+1, maxval)) & (turn' = 3)
    + 0.25: (turn' = 3)
    // etc etc etc

All of the actions are guarded on pot!=0. When it's empty, we instead have an ante action:

    [ante] (pot = 0) & !done -> (pot'=pot+4) & (p1' = p1-1) & (p2' = p2-1) & (p3' = p3-1) & (p4' = p4-1);

Since this means not all actions will be spins, I added a reward to track just the spins:

rewards "num_spins"
    [spin] true : 1;
endrewards

Querying the system

So why bother with all this when the Python would be like ten lines? The reason is that once you've gone through all the work of modeling something in PRISM, you get a powerful querying system to learn properties of the model. First, let's just find the expected value (weighted average) for the number of spins to knock out one player:

$ ./prism dreidel.prism -const M=10 -pf 'R=? [C]'    

Value in the initial state: 63.71149441710945
Time for model checking: 174.563 seconds.

So if we have four players, and each player starts with 10 coins, it'll take on average 64 spins to knock a single player out. I stopwatched a dreidel and it takes about 8 seconds for a spin, so you're looking at about nine minutes of watching a top spin before someone gets to leave the table.

Of course that's only the likeliest outcome. I can also get the cumulative probability like this:

const C0;
P=? [F<=C0 done]

That returns the probability that within C0 steps of starting, done is Finally true. Note this is total model checking steps and includes antes too. While I can run this from the command line, it's nicer to see in the GUI.

A graph of the gui showing that 90% of the time, you'll knock out one player within 150 steps

Those are still pretty easy to find from a regular simulation, so let's try a more complex property. What's the probability that the game ends because a player rolled a Shin, as opposed to an ante?

prism dreidel.prism 
  -pf "P=? [F (pot != 0 & !done & (X done))]" 
  -const M=6:4:10 
  -exportresults stdout

M       Result
6       0.2944666679018678
10      0.2930102532933088

(Don't forget to check the probability it ends because of an ante!)

What's the probability that the first player will have 75% of all pieces in the game, and still lose?

-pf "P=? [F (p1=maxval-M & (F p1=0))]"
-const M=5:5:15 
-exportresults stdout
M       Result
6       1.1876357695852015E-10
10      1.0185162507865054E-8
15      4.902270837317093E-8

It's extraordinarily unlikely but becomes less extraordinarily likely when you start with more coins. And again we're just modeling until one player runs out of coins, so the odds are slightly (slightly) higher than this.

Finally, the chance the pot maxes out at C0 coins:

-pf 'P=? [(F pot=C0) & (G pot <= C0)]'
-const M=10,C0=0:1:20

A graph showing the maximum pot size probability peaks at 6 coins and almost never goes above 9

Was this worth it?

Ugh no I hate PRISM

But also I think I figured out a different approach to modeling this that would better handle adding players and running until just one player is left. Maybe that'll be a blog post for next Channukah!


End of Year Administrative Stuff

Next week will be the last newsletter of the year. Quick updates on this year's projects:

  • Logic for Programmers: Haven't made any progress on it for a while, but a friend has kindly offered to help edit and keep me to a schedule, so I'm more optimistic for the next year.
  • Why programming languages doesn't have graph types: Done with interviews, writing an outline!

I want to write a new formal methods talk for next year, which would focus on the general history and idea of formal specification instead of just pitching TLA+ as a product. If you know any conference people who'd be interested, put me in touch!

When would you ever want bubblesort?

4 December 2023 at 20:23

There are very few universal rules in software engineering, but there are are a lot of near-universal principles. Things like "prefer composition to inheritance" is near-universal. I love finding the rare situations where these principles don't hold, like where you do want inheritance over composition. A similar near-universal principle is "don't use bubblesort". Some would even say it's a universal rule, with Donald Knuth writing "bubble sort seems to have nothing to recommend it, except a catchy name and the fact that it leads to some interesting theoretical problems".1 But Knuth's been wrong before, so let's see if this universal rule is only near-universal.

Theoretically, bubblesort is faster than quick or mergesort for small arrays. This makes it useful as part of a larger sorting strategy: most of the fast-in-principle sorting algorithms work by recursively sorting subpartitions of an array, ie if you apply quicksort to 2^20 random integers, at some point you're sorting 2^17 8-integer subpartitions. Switching over to bubblesort for those subpartitions would be a nice optimization.

Many production sorting algorithms do use a hybrid approach, but they overwhelmingly use insertion sort instead. Insertion sort is very fast for small arrays and it's also better at using the hardware. On some very particular hardwares bubblesort stills ends up better, like in this NVIDIA study, but you probably don't have that hardware.

So that's one use-case, albeit one still dominated by a different algorithm. It's interesting that NVIDIA used it here because gamedev has a situation that's uniquely useful to bubblesort, based on two of its properties:

  1. While the algorithm is very slow overall, each individual step is very fast and easily suspendable.
  2. Each swap leaves the array more ordered than it was before. Other sorts can move values away from their final positions in intermediate stages.

This makes it really good when you want to do a fixed amount of sorting work per frame. Say you have a bunch of objects on a screen, where some objects can occlude others. You want to render the objects closest to the camera first because then you can determine which objects it hides, and then save time rendering those objects. There's no correctness cost for rendering objects out of order, just a potential performance cost. So while your array doesn't need to be ordered, the more ordered it is the happier you are. But you also can't spend too much time running a sorting algorithm, because you have a pretty strict realtime constraint. Bubble sort works pretty well here. You can run it a little bit of a time at each frame and get a better ordering than when you started.

That reminds me of one last use-case I've heard, apocryphally. Let's say you have a random collection of randomly-colored particles, and you want to animate them sorting into a rainbow spectrum. If you make each frame of the animation one pass of bubblesort, the particles will all move smoothly into the right positions. I couldn't find any examples in the wild, so with the help of GPT4 I hammered out a crappy visualization. Code is here, put it here.

(After doing that I suspect this isn't actually done in practice, in favor of running a better sort to calculate each particles final displacement and then animating each particles moving directly, instead of waiting to move for each bubblesort pass. I haven't mocked out an example but I think that'd look a lot smoother.)

So there you go, three niche use cases for bubblesort. You'll probably never need it.


New Quanta Article!

Okay so I didn't actually write this one, but I played a role in it happening! A while back a friend visited, and we were chatting about his job at quanta. At the time he was working on this mammoth article on metacomplexity theory, so naturally the topic of problems harder than NP-complete came up and I recommend he check out Petri net reachability. So he did, and then he wrote An Easy-Sounding Problem Yields Numbers Too Big for Our Universe. Gosh this is so exciting!


  1. Taken from here

How do you take the average of two timestamps?

10 November 2023 at 00:22

New Blog Post

A better explanation of the Liskov Substitution Principle. First new post since August! Patreon notes are here.

How do you take the average of two timestamps?

Let's send off the 6x era with something silly. Some physical units of measure, like timestamps, can be subtracted but not added. What does it mean to add 2 PM to 6 PM? That makes no sense. But you can subtract 2 PM from 6 PM and get 4 hours. The output we get is a different type entirely, a "timedelta", that we can add to either timestamp.

I've seen these kinds of units variously referred to as interval measures or affine measures. Other affine measures are things like map coordinates, test scores, and temperatures. All things where finding the sum isn't sensible but finding the difference is. You can think of the "point" and the "delta" as being different types, where operations with two points or two deltas produce deltas, while operations with a point and delta produce a point.

About that averaging

Today the sun rose at 6:32 AM and set at 4:35 PM. When was high noon?

The easiest way to find a midpoint between two numbers is to just do (a + b)/2, but we can't just add the two timestamps. Instead we have to do this:

from datetime import time
sunrise = datetime(2023, 11, 9, 6, 32, 0)
sunset = datetime(2023, 11, 9, 16, 35, 0)
>>> sunrise + (sunset - sunrise) / 2
datetime.datetime(2023, 11, 9, 11, 33, 30)

This way we're not adding timestamps to timestamps, we're adding timedeltas to timestamps. All the math checks out.

By contrast, here's what happens if we do things wrong:

>>> h, m = 16+6, 32+35
>>> h, m
(22, 67)
>>> h/2, m/2
(11.0, 33.5)

wait that's the same thing

Okay so here's why we have affine measures in the first place: we use them when we're not measuring from an objective zero point. For some measures, like dates, there is no objective zero, while for others there's a true zero that's too inconvenient for general use. In this context, addition is sensitive to our choice of zero, but subtraction is not— nor, curiously, is averaging.

This is easier to see with temperature measurements. Absolute zero in Kelvin is 0 K, while for Celsius it's -273° C. So Celsius is an affine measure: we get totally different numbers if we add two temperatures in °C than if we add them in K.

>>> K = lambda x: x + 273
>>> K(30 + 20)
323
>>> K(30) + K(20)
596

But we can meaningfully subtract two temperatures, and get the same results regardless of which measure we use.

>>> 30 - 20
10
>>> K(30) - K(20)
10

And in fact it doesn't matter which offset we use, either: if K = C - 123.46223 we'd still get the same result for K(30) - K(20) because the offsets cancel out.1

The offset explains why addition is weird and subtraction makes sense: if f(x) = mx + b, then f(x) + f(y) = m(x + y) + 2b, while f(x) - f(y) is just m(x - y). This also explains why averages work:

mean(f(x) + f(y) + f(z))
  = (mx+b + my+b + mz+b)/3
  = m(x + y + z)/3 + b
  = f(mean(x, y, z))

And again, it doesn't matter what m and b are: averaging is well-behaved for affine measures regardless of their "affinity".

But what about our poor computers

Just because we can mathematically average affine units doesn't mean computers can do it. I don't know of any languages where you can say "these units can't be added, unless it's part of taking the mean, then it's fine." So what if we avoided adding the units? Here's a different algorithm to find the average of N timestamps:

  1. Pick some arbitrary origin, like 1970-01-01 00:00:00 UTC.
  2. Subtract the origin from every timestamp, giving us a set of N timedeltas.
  3. Take the average of the timedeltas, giving us the mean timedelta x.
  4. Add x to the arbitrary origin.

The cool thing here is when I mean an arbitrary origin, I mean arbitrary. It literally does not matter what you choose. And it just so happens that if I pick "12:00 AM today" as my origin, then the timestamp 6:32 becomes the timedelta "6 hours and 32 minutes". Then we're effectively subtracting 0 from every timestamp, taking the mean, and adding the mean to 0. Which is just adding the timestamps and dividing by N.

>>> origin = datetime(2023, 11, 9, 0, 0, 0)
>>> str(sunrise.time()), str(sunset.time())
('06:32:00', '16:35:00')
>>> str(sunrise - origin), str(sunset - origin)
('6:32:00', '16:35:00')

>>> noon = (sunset - origin + sunrise - origin) / 2
>>> str(noon)
'11:33:30'
>>> str(origin + noon)
'2023-11-09 11:33:30'

As for how we originally found the midpoint? That's just the special case where we pick origin = sunrise:

sunrise + ((sunset - sunrise) + (sunrise - sunrise)) / 2
= sunrise + (sunset - sunrise) / 2 

I don't have any programming insight or language implementation advice based on all of this. I just think it's neat how adding affine measures is mathematically invalid but there's a loophole that makes averaging them valid. Computers might not be smart enough to use that loophole, but we can!


  1. Things are a bit different when converting to Fahrenheit, since that has a scaling factor in addition to the offset. 

Some thoughts on software expertise

7 November 2023 at 16:54

(Note, there's some important administrative stuff after the newsletter.)

I think a lot about the nature of expertise.

While I write about all sorts of software topics, I'm actually paid to do formal specification work. Historically that's meant running workshops for companies or reviewing their specs. This means that I experience "being in expert" in relation to other people who are complete beginners, and that's influenced my thinking of what "expert" means. Often beginners get stuck on small things and pitfalls like using the wrong data type or not knowing the standard way to solve a problem. So I think of an "expert" as being someone who's internalized all the standard moves and can just get on with solving the problem.

In other words, the main difference between a beginner and an expert is the expert has "seen it all before". They can quickly solve standard problems and can pattern-match new problems to old ones. But once they've built their mental models, the beginner will be able to do anything the expert can... just a lot slower.

This year I was also involved in two extremely large specification projects, one in TLA+ and one in Alloy. The projects were for very large and very complicated systems, and both took a lot more work to specify than the average production system. I joined the projects as a trainer and coach, but in both cases I was much more useful as a sort of "technique inventor". Formal specification is still a surprisingly young field and there's a lot that people don't know how to do. I had to figure out how to do it.

For example, it's generally very hard to compose two specifications together. People want to have a library of prefabbed specs they can glue together in the same way you can do that in programming, but conventional wisdom is you shouldn't do this, you should write them all together as one big spec. That's usually the better approach.

But in the TLA+ project we needed to model complex interactions between several distinct components and each component had complex internal behavior that we also needed to specify. The only way to do this is to make separate large specifications and combining them together. Except specifications don't easily compose. My role, then, was figuring out how to compose them anyway.

I solved it with something like this: first define each component's abstract state machine in a very particular style, where the state transitions where encoded as data. Then refine that into an actual component that uses action conjunction to tie implementation state updates to transitions. With that, it's possible to "compose" specs by linking their transitions with "synchronization points", which ultimately made the whole thing work.

Don't worry if that didn't make any sense— it took a while to explain to competent TLA+ practitioners too, and they got to see diagrams and actual code samples. Once we were on the same page it worked incredibly well, though. I should do a blog writeup of the technique, I think it'd be useful to a lot of people.

(Not every technique I came up with was this complicated, in fact a lot were much easier. Like in this article on optimization, I share a trick to rapidly profile optimizations: adding TLCGet("level") < N as a state constraint. It was really easy to show people but I don't think it's something a non-expert practicioner would come up with on their own.)

Inventing these techniques felt qualitatively different from the "pattern matching on problems" I think of as expertise. It's like if I learned English purely as a technical language, and then realized that "gear" sounds the same as "fear", and from that self-invented poetry. It sounds a little pretentious saying it that way, but at the same time I'd guess it's something you've all experienced, because I'm sure you're all experts in something or another.

Besides that TLA+ project, I'm also building a large system in Alloy. We're still working on it but even now it might be the largest Alloy production model, ever. Certainly the largest one to use the new Alloy 6 features! We had to figure out how to actually use all of those features at scale.

Instead of coming up with a couple of big new techniques over a whole project, it's more like finding a new small technique every day. Like "how to show the exact step in a trace where a property is violated" or "how to find all atoms satisfied P at any point in the trace" or "how to check a spec against a bunch of different classes of initial configurations."1 It's the kind of workhouse stuff that has the potential to become best practices once I get around to writing it all up.

This aspect seems really tied in with the nature of "being an expert". I couldn't tell you what a C# expert knows about C#, but I can tell you how a C# expert can think about C#.


Newsletter Scheduling Updates

First thing, there will be a regular newsletter this Thursday, and after that I'll be in Hong Kong for a couple of weeks. Between that and Thanksgiving there won't be any more newsletters until Decemberish.

Once the newsletter resumes it will be back on a weekly update schedule. I've done the 6-a-month experiment for a year now and don't think it's sustainable long term, and it's also made it hard to write anything else. There might still be a surprise newsletter occasionally but I'm not committing to the same pace.


  1. 90% of the time the answer is functions. Lots and lots and lots of functions. 

I shouldn't be so harsh on teaching TLA+ with puzzles

2 November 2023 at 20:24

I know I'm getting the newsletter out very late this week (this kind of lackadaisical nonprofessionalism is why I don't offer paid subscriptions) so I'll keep this one short.

Over the past couple weeks I've seen a few TLA+ articles where people show how to solve the wolf-cabbage-goat puzzle. You know the one where you have a farmer who has to cross a river in a boat, but he can only take one item each trip, and has to keep the wolf from eating cabbage or the goat from eating a wolf?1 It's a common example used to introduce formal methods it makes me worried that people'll think FM is only useful for solving toy puzzles. I prefer to teach with examples of real-world systems.

A similar example is the SEND MORE MONEY puzzle in all the constraint solving books: assign unique digits to letters such that SEND + MORE = MONEY. I've read books that had dozens of examples, all like this, and I was left wondering what I was supposed to do with constraint solving.

Anyway, I now think I'm being too hard on these kinds of examples. Two reasons for this. First, it provides more opportunity for practice. Getting good at these things takes time and work, and you need to understand the domain to model it. So practicing TLA+ on a bunch of puzzles might be easier than practicing on a bunch of real-world systems, since you don't need to spend as much time understanding the premises of a puzzle as you do a whole system. I noticed this while learning Picat: I did so in order to solve a practical problem, but I learned and practiced by writing Petri net solvers.2

Now the other reason is a little more philosophical: what makes computation interesting? "Computation" is not the same as programming. This is programming:

>>> print("hello world!")
hello world!

And this is programming that's also computation:

>>> 100**(1/17)
1.3111339374215643

Calculating the 17th root a number isn't something I can do myself, or even with a pocket calculator. But with a computer I can do it easily, as long as I know a programming language that makes it easy to express the computation. On the other hand, puzzles like wolf-cabbage-goat and SEND MORE MONEY are hard to express. If you want to solve them in Python, you'll have to write your own solver engine, and that'll take magnitudes more time than originally expressing the problem does.

So it's interesting that the TLA+ solution can be under 20 lines. It's shows that TLA+ can be used to compute things that ordinary programming languages cannot, at least not without a lot of extra programmer effort.

Learning TLA+ (or constraint solvers, or functional programmer, or whatever) expands the range of computations the user can easily do, and showcasing this with puzzles is a good way to demonstrate that. Just make sure to teach actual use cases too!

As penance for my prior grognardery, here's Tower of Hanoi in Picat:

import planner.

final({Rung1, Rung2, Rung3}) =>
  Rung1 = [], Rung2 = [], sorted(Rung3).

action(S0, S1, Action, ActionCost) =>
  Action = S1,
  ActionCost = 1,
  member(From, 1..3),
  member(To, 1..3),
  S1 = copy_term(S0),
  S1[From] = [H|T],
  if S1[To] != [] then head(S1[To]) > H end,
  S1[From] := T,
  S1[To] := [H] ++ S1[To].

main =>
  Start = {[1,2,3], [], []},
  best_plan(Start, 20, Plan),
  writeln(Start),
  foreach(Action in Plan)
    writeln(Action)
  end,
  nl.

Next week will have two newsletters as normal, and then I'll be on vacation until after Thanksgiving. Have a great weekend everyone!


  1. Goats are terrifying. 

  2. There need to be more Petri net puzzles. 

Raku is surprisingly good for CLIs

26 October 2023 at 22:13

A while back I wrote Raku: a Language for Gremlins about my first experiences with the language. After three more months of using it I've found that it's quite nice for writing CLIs! This is because of a couple features:

MAIN params become options

Like many other programming languages, Raku has a special-cased MAIN function that acts as the entry point into your script. But unlike f.ex C, MAIN can have an arbitrary signatures:

# strcmp.raku

sub MAIN($str1, $str2) {
  ... # blah blah blah
}

Then when you call the file:

> raku .\strcmp.raku
Usage:
  .\strcmp.raku <str1> <str2>

Raku automatically converts MAIN's parameters into script arguments! That's already pretty useful, and it gets better:

sub MAIN(
  $str1,         #= first string
  $str2 = "foo", #= second string
) {
  ... 
}
> raku .\strcmp.raku
Usage:
  .\strcmp.raku <str1> [<str2>]

    <str1>      first string
    [<str2>]    second string [default: 'foo']

So now we also have help messages and default values. To get flags and options, just define keyword parameters:

sub MAIN(
  $str1, #= first string
  $str2, #= second string
  Bool :$i, #= case insensitive?
  Int :$option #= some integer option
) {
  ...
}
Usage:
  .\strcmp.raku [-i] [--option[=Int]] <str1> <str2>

    <str1>            first string
    <str2>            second string
    -i                case insensitive?
    --option[=Int]    some integer option

Here's what the same CLI would look like in Python:

import argparse

if __name__ == "__main__":
    parser = argparse.ArgumentParser()

    parser.add_argument("str1", help="first string", type=str)
    parser.add_argument("str2", help="second string", type=str)
    parser.add_argument("-i", help="case insensitive?", action="store_true")
    parser.add_argument("--option", help="some integer option", type=int)

    args = parser.parse_args()

That's over twice as long to do the same thing— three times as long without the help strings. That said, there's a few special things I do with argparse that don't seem possible in Raku. But I still prefer the latter approach, especially when we throw in the other big feature:

Multimethods make subcommands easier

Raku has multiple dispatch: you can define multiple functions with the same name but different parameters, and then when you call the function it will pick the "most appropriate" method based on the supplied arguments.

# Instead of

sub myroot($val, $root = 2) { ... }

# You can write

multi myroot($val, $root) { ... }
multi myroot($val) { myroot($val, 2) }

This is already useful for defining parameter groups, like saying MAIN can accept either the --foo option or the --bar option but not both. But multimethods can also dispatch on values, like exact strings. That gives you subcommands!

multi MAIN( 
  "profile",
  $file,
  Bool :$verbose,
  :$output = $*OUT
) {
  ... 
}

multi MAIN( 
  "reformat",
  $file,
  Bool :$dry-run = False
) {
  ...
}
Usage:
  .\strcmp.raku [--verbose] [--output[=Any]] profile <file>
  .\strcmp.raku [--dry-run] reformat <file>

I use this to group a bunch of thematically-similar tools in the same file and distinguish calling them with the subcommands.

Miscellaneous useful things

There's a few small things that aren't as important as those two features, but still help with QoL.

  1. You can give flags "aliases" by writing Bool :v($verbose). Then you can call it on the CLI as either -v or --verbose.
  2. You can define the special hash %*SUB-MAIN-OPTS to configure how the arguments are parsed. F.ex if you add the :bundling key, then it will accept -abc in addition to -a -b -c.
  3. Raku provides some hooks to override CLI behavior, like if you define the ARGS-TO-CAPTURE function you can preprocess arguments before dispatching to MAIN.

One thing I've not be able to figure out is how to make variadic options, like having --src and --dest both take multiple paths. AFAICT the best you can do is make them both take flat strings and then postprocess them into paths. Maybe that'll get proper support in the future!

I wish people would do historical arguments better

24 October 2023 at 17:02

There's a common argument I hear that goes like this:

  • Pull requests were originally intended for low-trust environments, so we shouldn't use them on tight-knit teams.
  • Open source was originally coined to be palatable to corporations, so we should write Free software instead.
  • The web was originally for sharing text documents, so we shouldn't use it to run apps.
  • The wheel was originally developed for shaping clay, so we shouldn't use it to move things.

Okay I haven't seen the last one yet but it's the same kind of argument. All of these look at the historical context where a tool or practice arose, note that it's different from the current context, and conclude that it's unfit for its current purpose. I'm a firm believer in understanding the historical context of our practices, and I think that it does have weight in evaluating our current practices. But arguments based on history can be a bit misleading if you don't look deeper into the actual history. There's more work you have to do.

First of all, you have to establish what changed between then and now. Take the argument that "the web was originally for sharing documents". The World Wide Web entered public domain in 1993. Macromedia Shockwave came out just two years later. For 93% of the web's total life, at least, the public used it as a universal platform. So saying "the historical context is different" just tells us that the "original intention" badly mispredicted how people wanted to use the internet.

You also need to understand how the current context and how people use it differ. The common historical argument against pull requests and code review:

I see pull requests as being a way of gate-keeping in low trust environments like an open source project. Why would we (developers) adopt pull requests on our teams yet we don't consider our team/workplace being a low trust environment?

I could totally believe that pull requests were originally a means of (mission-critical) "gatekeeping in low-trust environments", which are different from how companies develop software. But then I'd ask are companies using pull requests as a form of gatekeeping? Or are they adapting the form to their own needs? Like at all the places I worked, PRs were used to bundle fixes for an issue-tracker bug into one changeset to get a trail of what was fixed, when, by who. That's a really useful thing to have! Also, I can trust my team but still use PRs as a way of getting a fresh pair of eyes on my changes, to make sure I didn't do something boneheaded or introduce a subtle problem only the deployment/security/performance expert would notice. My current context is different from the historical one, but I've also adapted the tool to fit the current context.

(You could respond with "you can do that with PRs but pairing/mobbing is better for those things." And that's fine, but we should discuss that with respect to how PRs are used, not how they were originally used. Plus, if things have been adapted before, perhaps there's further adaptions we should be looking for, too?)

Reading this, I feel like there's a third thing you-gotta-do that I'm missing, but I'm not seeing it.

How to do it better

I think I did a better job with my post Why Do Interviewers Ask Linked List Questions, which argues that companies originally asked linked list questions to gauge C experience, and then we sorta just kept asking them. I never actually say "so we should stop asking them" but I really strongly implied it so it's the same kind of argument. Here's how I addressed my two objections:

  1. How did the context change? No hard evidence (history is cruel like that), but I wrote six paragraphs arguing that it was due to Joel Spolsky + Cracking the Coding Interview.
  2. Do people use it differently now? I claim the only difference between then and now is that people say it's about "CS fundamentals"/"reasoning through problems", and that these questions are bad tests of reasoning.

(Still not seeing that third thing.)

My point is that you can do a historical argument well, and you can do it poorly. So you might as well do it well.


Misc Project Updates

  • The Graph Project is going well! I've interviewed three people so far and am starting to build some theories on why we don't have graph libraries. It's also got me deeply curious as to why all modern languages do have mapping types. Like from a historical perspective, not a use-case perspective.

  • Logic for Programmers is far behind where I want it to be, and I don't think it'll actually be worth money anytime soon. My current plan is to release a free online website end of this year / early next year and then offer a PDF/print version to anyone who wants to pay.

  • I got like eight blog posts that are near- or at- first draft level, and I just need to stop being a perfectionist and finish them.

  • I've started posting "sketchpad" ideas over on the Patreon. Some of them might become blogposts or newsletters later.

Some thoughts on good spec properties

10 October 2023 at 16:27

Lots of updates this week! First of all, my talk Is Software Engineering Real Engineering is now available online! It's the video version of the crossover project. Link is to the video, slides, references, and FAQ. I know a lot of people have been waiting for this; I hope it lives up to expectations :)

Second, the TLA+ workshop is on Monday! There's still a couple slots if you want to sign up; use the code C0MPUT3RTHINGS for 15% off.

Like with any discipline, using TLA+ (and formal methods more broadly) takes a lot of different skills. One of them is choosing good properties to verify. I want to explore what we mean by "good properties". Let's take a real simple model of wire transfers:

EXTENDS TLC, Integers
CONSTANTS People, NumTransfers

Money == 1..10

(* --algorithm wire
variables
  acct \in [People -> Money];

define
  NoOverdrafts == "???"
end define;

process wire \in 1..NumTransfers
variable
  amnt \in 1..5;
  from \in People;
  to \in People;
  ready = FALSE;
begin
  Check:
    if acct[from] >= amnt then
      ready := TRUE;
      Deposit:
        acct[to] := acct[to] + amnt;
      Withdraw:
        acct[from] := acct[from] - amnt;
        ready := FALSE;
    end if;
end process;
end algorithm; *)

The bug here is that if we do two wires, they can race each other past the Check clause and leave someone with a negative balance. We want to write NoOverdrafts to catch this error.

There's two ways to do that. We can say that ready => (acct[from] >= amnt) {1}.1 We can also say that acct[from] >= 0 {2}. Notice they'll both fail for the same behavior of two concurrent wires, but {1} will fail a couple of steps earlier and give a shorter error trace. In my experience, that makes {1} closer to how we think about good programming practices: make errors appear as early as possible.

But I'd argue that {2} is a better property for specification:

  1. {1} says the bad thing is very likely to happen, but it's not guaranteed.2

  2. {1} is an internal property while {2} is external. An outsider observer can't see the value of ready but they can see if acct is negative. So {2} being violated can actually have downstream effects on the external world, which makes it more important.

  3. {2} is mathematically stronger. Right now we only have one type of action in this system, Wire, but complex systems have lots of different actions. Model checkers are looking for a worst-case scenario across all sequences of actions, and that might not include a Wire action. So {2} will automatically catch things that {1} misses.

Note that writing {1} can still make sense, since it'll fail earlier with a shorter error trace. But {1} is only useful if you also have {2}. My point is that in specifications the broader properties are "more important" than narrow ones. Beginners to FM tend to write narrow properties because they're used to writing narrow tests, and that's something they have to unlearn.

Why the difference? Most people run test suites as part of their coding process, so they want the suite to run quickly. That puts lots of constraints on what you can easily check. Model checking a specification usually happens separately, so isn't as tightly time-constrained. Broader properties make sense.

On writing that, I wonder if I should think of {1} as a "regression properties" kinda similar to a regression test. IE only write {1} after I see {2} fail and see the error trace, as a way of more quickly debugging regressions in the spec.

Pulling this all together. Some heuristics for good properties:

  • Represent an actual failure in the system, not just the potential for failure
  • Are observable differences in the external state, not the internal state
  • Are broad enough to cover many different narrower properties

In conclusion, I just realized I have like a million more proto-thoughts on TLA+ and the nature of specification all bouncing around inside. Maybe for a future newsletter.


  1. "If a wire is ready, its amnt is ≤ acct[from], otherwise ignore this requirement". 

  2. Wire X won't overdraft if Wire Y deposits money in between X's deposit and withdrawal. 

Picat is my favorite new toolbox language

3 October 2023 at 14:45

I'm teaching a TLA+ workshop in two weeks! Register here, use the code C0MPUT3RTHINGS for 15% off.

I always like to find new "toolbox languages". These are languages that can quickly be used to quickly solve a set of problems with just the base language and without a whole lot of typing. That's what drove me to learn Raku, so I could quickly do things like "generate 5 random 10-character strings":

for ^5 {say ('a'..'z').roll(10).join}

One toolbox language I've had my eye on is Picat,1 a "logic-based multi-paradigm programming language aimed for general-purpose applications". I'm not sure where I first learned about it, maybe around the same time I encountered MiniZinc? The userbase is extremely small, and as with many small communities, the documentation assumes you have the same background as the inventors. But it's also an incredibly interesting language and I'd like to try explaining why.

"Logic-Based"

(Heads-up, I've only dabbled in Prolog a little and I'm sure I've got a hundred subscribers who are much better at logic programming than I'll ever be. Nonetheless I'm going to do my best to explain it.)

Logic programming is, very broadly, where you program by writing a set of equations and ask the program to find values for each variable for that make all the equations true. For example, if I want to concatenate two strings in Picat, I'd write

Picat> [1,2] ++ [3, 4] = Z.

It will then try to find a matching value for Z:

Z = [1,2,3,4]
yes

The yes means that it was able to find a satisfying value for Z. The cool part is that this is bidirectional: it can also find inputs that give the output!

Picat> [1,2] ++ Y = [1,2,3,4].  
Y = [3,4]
yes

When if there are multiple satisfying solutions (unifications), Picat can list them all. EX writing member(X, [1, Y]) gives both X=1 and Y=X as possible solutions.

"Multiparadigm"

Picat is, at its heart, a logic programming language. It borrows a lot of core ideas and most (most) of its syntax from Prolog. But it also adds two paradigms on top of that.

The first is constraint solving. Like logic programming, constraint solving is about finding values that match a set of equations. Unlike logic programming, we're exclusively talking numbers. Usually we're also trying to find the solution that mini/maximizes some other metric.

  % Maximize area for given perimeter

  Vars = {Len, Width},
  Vars :: 0..10, 
  Len*2+Width*2 #<= 22, 
  solve([$max(Len*Width)], Vars).  

Combining constraint solving and logic programming isn't that usual; SWI-Prolog has the CLP(FD) module. Picat's far more unusual paradigm is imperative programming:

{1, 2} ++ {3, 4} = Z,
foreach (I in 1..Z.len)
    Z[I] := Z[I] * 2
end,

Z = {2,4,6,8}

I never thought I'd see the day where a new language advertised "we're more imperative" as a selling point, but it's actually really nice. I've had a few cases where I could mostly express a problem logically but need to use a bit of imperative assignment to get it all working. Maybe I wouldn't need it if I was an experience prologer. But for n00bs like me it's a great escape hatch.

My favorite feature

The planner module took me from "this could be useful" to "damn this is cool". You define an initial state, a final state, and a set of actions, and it tries to find a sequence of actions that get you from here to there. It's like a model checker in reverse!

Here's an example problem: we have a list of numbers that we add one at a time to reach a target number. If we have input [2,2,3,4] and had to get 9, we could do 2,3,4 to go 2->5->9. Now let's also add in "forbidden numbers" you cannot pass through. If 5 was forbidden, you'd instead have to do 4,2,3 and go 4->6->9.

In Picat:

import planner.

action(S, NextS, Action, ActionCost) =>
   S = {Target, Current, L, Forbidden},
   member(X, L),
   Next = Current + X,
   not member(Next, Forbidden),
   NextS = {Target, Next, delete(L, X), Forbidden},
   Action = {add, X, Next},
   ActionCost = 1,
   true.

final({Target, Current, _, _}) => 
  Target = Current.

Our state is a 4-tuple consisting of the target value, the current value, the numbers we can add (including duplicates), and the forbidden numbers. We only have one action, which picks a number, checks that adding it to Current doesn't give us a forbidden number, and then sets the values in the next state (NextS). The ActionCost needs to be assigned but doesn't matter here.

main =>
    S0 = {19, 0, [6, 6, 9, 5, 9, 2], []},
    S1 = {S0[1], S0[2], S0[3], [17]},
    plan(S0, Plan0),
    plan(S1, Plan1),
    writeln(Plan0),
    writeln(Plan1),
    nl.

We could also call plan(S0, Steps, Plan0, Cost) in which case it would only find plans up to Steps long and with a cost under Cost.

% output
[{add,6,6},{add,6,12},{add,5,17},{add,2,19}] % Plan0
[{add,6,6},{add,6,12},{add,2,14},{add,5,19}] % Plan1

How cool is that?!

This gets especially useful when you combine it with the other language features. One thing I've done is put a constraint problem inside of final and then add an action that loosens the constraints. So it first tries to solve the basic problem, and if it can't it gradually loosens the constraints until it's possible.

My use cases

Right now Picat is too unpolished to be "general-purpose" language. The documentation is especially sparse and highly technical, I had to do a good amount of experimentation to figure out how everything. I still have no idea how the index keyword works or how to use the actor features. I don't know if it's suited to become a general-purpose language, the kind of thing you'd write a webserver in.

But it's a fantastic toolbox language. I was originally motivated by a vacation scheduling problem: we had a set of activities we wanted to do, each with a set of restrictions. IE certain activities couldn't be done on the same day, all the things that required a car had to happen in a 24-hour span, stargazing needed a clear sky and being outside the city, stuff like that. Picat elegantly handled all the requirements and spit back schedules.2 Two other ways I've used it:

  1. Testing if a logical puzzle has a unique solution or not. I've done this in both MiniZinc and Alloy, but Picat handles this better than either. I want to try using the planning module to also test puzzle variations.
  2. I wrote a petri net simulator in <70 lines of code. Check it out here!

I hope this convinced you to check Picat out! Two resources to get started:

  1. A User's Guide to Picat: First chapter is a real barebones tutorial, rest of the book is reference material. Lots about implementation details, not much on good practice.
  2. Constraint Solving and Planning with Picat: A lot more conventional introduction, really heavily emphasizes constraint solving over other paradigms. Also written ~circa Picat 1.3, it's up to 3.5 now.

There's also a WebIDE, but it doesn't show errors.


  1. PIE-cat 

  2. I learned Picat ~2 weeks after the vacation, we did the original schedule by hand 

Was Javascript really made in 10 days?

28 September 2023 at 14:30

I once heard that Javascript has so many warts because the first version was made in just ten days. I was curious 1) if this is true, and 2) if it explains the language's warts.

After some research, I can unconfidently say: it's complicated.

The "first version" of JavaScript did in fact take ten days. The exact dates aren't confirmed, but Brendan Eich recalls it being May 6-15, 1995. But this was only a minimal prototype ("Mocha") for internal demonstration. JavaScript 1.0 was publicly released in March 1996 (pg 10) and the first "complete" version in August 1996 (ibid). Even after that point, the Netscape team regularly tweaked the design of JS; Eich recalls that "Bill Gates was bitching about us changing JS all the time" in the Fall of 1996.

Eich also had about ten years of experience with language design and compiler developer and was explicitly hired by Netscape to put a programming language in the browser (pg 7). Originally this was supposed Scheme, but then Netscape signed a deal with Sun and agreed to make it more "Java-like".

Does this explain the warts?

Most of JavaScript's modern flaws are arguably not due to the short development time:

  • Mocha didn't originally have implicit type conversion, but users requested Eich add that to 1.0 (video link). He deeply regrets this.
  • JS 1.0 added null to be more compatible with Java (pg 13. Java compatibility is also why typeof null = object.
  • Any JavaScript API warts must have come afterwards because all the API work happened after Mocha. Mocha's a pretty minimal language!
  • The "all numbers are floats" problem was originally in Mocha, but I think that was always the intended behavior. The JavaScript 1.0 manual cites HyperTalk as a major inspiration. I've never used HyperTalk skimming the manual suggest to me it does the same thing (pg 102, 517).

I found one way the 10-day sprint definitely did harm JavaScript: Brendan Eich didn't have time to add a garbage collector, and later attempts to add that in added a bunch of security holes (43:04).

This newsletter takes the new record for "most hours of research per word."

Formal Methods can't fix everything and that's okay

26 September 2023 at 15:00

Finally done with conferences and glad to be back in Chicago. Next big event is the TLA+ workshop on October 16, which still has slots available! The faster I can fill those up, the sooner I can stop putting advertisements in my newsletters. I don't like doing it any more than you like reading it. Use the code C0MPUT3RTHINGS for 15% off.

Keeping on the topic of TLA+, I saw this question on linkedin:

I've always wondered - is there much use of building formally verified software on top of non-formally verified systems (like the OS the software runs on)? To me, the absolute layman here, it feels like we're building proofs on top of potentially/probably flawed axioms. If we can verify Raft is a solid algorithm but it's built on non-verified networking devices do we end up putting a wrong amount of trust into the system?

This is the impossible problem of trusting trust: is formally verified code really provably-correct if it's compiled with an unverified compiler? What if the code's correct and it's running on an unverified OS? You could use a verified OS like seL4 but what about the hardware itself? What if the hardware design is proven but the manufacturing process adds a backdoor? Verifying the whole process is too hard for anybody but the richest countries. So why do formal verification at all?

Mossad gonna Mossad

In times like these, I turn to the wisdom of James Mickens.

If you haven't heard of Dr. Mickens, he's a professor of distributed systems research at Harvard. Also he writes like this:

BREAKING NEWS: I've received tenure at Harvard! I want to thank all of the enemies that I had to destroy to achieve this great honor. Roger Davis at Princeton’s department of nutrition—you questioned my research on the efficacy of an all-Pop-Tart diet, but I am living proof that the diet works. Yes, I have nose bleeds every day and my pancreas has the dysfunction of a failing Soviet client state, but I believe that having constant double vision makes me twice as optimistic about life. [3500 more words]

Basically all of his writing and talks are great but the relevant one here is This World of Ours, his rant about computer security.

If the security people are correct, then the only provably safe activity is to stare at a horseshoe whose integrity has been verified by a quorum of Rivest, Shamir, and Adleman. Somehow, I am not excited to live in the manner of a Pilgrim who magically has access to 3-choose-2 {Rivest, Shamir, Adleman}, mainly because, if I were a bored Pilgrim who possessed a kidnapping time machine, I would kidnap Samuel L. Jackson or Robocop, not mathematical wizards from the future who would taunt me with their knowledge of prime numbers and how “Breaking Bad” ends. […] My point is that security people need to get their priorities straight.

He helpfully lists the three different priorities people should worry about, which I'll painstakingly transcribe here:

Threat Solution
Ex-girlfriend/boyfriend breaking into your email account and publicly releasing your correspondence with the My Little Pony fan club Strong passwords
Organized criminals breaking into your email account and sending spam using your identity Solution Strong passwords + common sense (don’t click on unsolicited herbal Viagra ads that result in keyloggers and sorrow)
The Mossad doing Mossad things with your email account Magical amulets? Fake your own death, move into a submarine? YOU’RE STILL GONNA BE MOSSAD’ED UPON

Yes this was all an elaborate way of saying the word "threat modeling", but if I just said "threat modeling" I wouldn't have been able to introduce you to James Mickens, now would I?

Mossad gotta budget

Threat modeling is fundamentally about cost-benefit analysis, on both sides of the threat.

Attacking a weak password is really easy, so it pays to attack even low-value targets. Adopting a strong password is also easy. If you're a low-value target, this will protect you from a lot. But if you're a high-value target, attackers will be willing to spend more resources attacking you. So you'll need to invest more in your defenses, which makes fewer people able to attack you. As this escalates, both the attacks and defenses get more and more expensive, as ever smaller groups of people can afford them.

This ends up being what protects us from the Mossad. We're ultimately beneath their notice because they want to spend their resources aiming at higher-value targets. Even so, we're vulnerable to people looking for easier meals, which means we still need strong passwords and 2FA and phish-literacy and stuff.

How this ties back to formal methods

Like with security, how much we do for correctness depends on how valuable our work is, and how easy it is to "attack" it.

Most work is "low-value", and bugs aren't going to cause much damage or cost a lot of developer time. This kinds of work doesn't even need formal methods; tests, type systems, and code review in some combination are enough to deal with them.

Some work is extremely high value. Militaries want verified OSes in addition to verified code. For a long time all of the seL4 users were militaries, though I think that's changed in the past few years. I'd guess they also have really strict manufacturing standards or something to prevent supply chain attacks.

Finally, some work is high-value enough that we should verify the code/design but not valuable enough to be worth investing in a verified OS or hardware stack. There might still be bugs from the OS but {cost of OS bugs} < {cost of verifying the OS} while also {cost of verifying code} < {cost of code bugs}.1

In fact we almost never want to verify the whole app, either! Generally we only apply formal methods to a subset of the system, like a single module or a single part of the abstract design, that is most important to get right. The point is that we apply it in ways that still add value.

I think this is where FM's potential as "provably correct" harms its adoption. We know type systems and unit tests can't catch all possible errors, so we don't expect them too, we're happy when they catch more errors than they catch. But we know it's possible for FM to catch everything, so we're reluctant to go only partway with it.


  1. I'm mixing verifying code and verifying designs here. I believe that often verifying code is not worth it while verifying designs is. 

❌
❌