Normal view

There are new articles available, click to refresh the page.
Before yesterdayComputer+Things

A list of ternary operators

5 November 2024 at 18:40

Sup nerds, I'm back from SREcon! I had a blast, despite knowing nothing about site reliability engineering and being way over my head in half the talks. I'm trying to catch up on The Book and contract work now so I'll do something silly here: ternary operators.

Almost all operations on values in programming languages fall into one of three buckets:

  1. Unary operators, where the operator goes before or after exactly one argument. Examples are x++ and -y and !bool. Most languages have a few critical unary operators hardcoded into the grammar. They are almost always symbols, but sometimes are string-identifiers (not).
  2. Binary operators, which are placed between exactly two arguments. Things like + or && or >=. Languages have a lot more of these than unary operators, because there's more fundamental things we want to do with two values than one value. These can be symbols or identifiers (and).
  3. Functions/methods that prefix any number of arguments. func(a, b, c), obj.method(a, b, c, d), anything in a lisp. These are how we extend the language, and they almost-exclusively use identifiers and not symbols.1

There's one widespread exception to this categorization: the ternary operator bool ? x : y.2 It's an infix operator that takes exactly three arguments and can't be decomposed into two sequential binary operators. bool ? x makes no sense on its own, nor does x : y.

Other ternary operators are extremely rare, which is why conditional expressions got to monopolize the name "ternary". But I like how exceptional they are and want to compile some of them. A long long time ago I asked Twitter for other ternary operators; this is a compilation of some applicable responses plus my own research.

(Most of these are a bit of a stretch.)

Stepped Ranges

Many languages have some kind of "stepped range" function:

# Python
>>> list(range(1, 10, 2))
[1, 3, 5, 7, 9]

There's the "base case" of start and endpoints, and an optional step. Many languages have a binary infix op for the base case, but a few also have a ternary for the optional step:

# Frink
> map[{|a| a*2}, (1 to 100 step 15) ] 
[2, 32, 62, 92, 122, 152, 182]

# Elixir
> IO.puts Enum.join(1..10//2, " ")
1 3 5 7 9

This isn't decomposable into two binary ops because you can't assign the range to a value and then step the value later.

Graph ops

In Graphviz, a basic edge between two nodes is either the binary node1 -> node2 or the ternary node1 -> node2 [edge_props]:

digraph G {
  a1 -> a2 [color="green"]
}

Output of the above graphviz

Graphs seem ternary-friendly because there are three elements involved with any graph connection: the two nodes and the connecting edge. So you also see ternaries in some graph database query languages, with separate places to specify each node and the edge.

# GSQL (https://docs.tigergraph.com/gsql-ref/4.1/tutorials/gsql-101/parameterized-gsql-query)
SELECT tgt
    FROM start:s -(Friendship:e)- Person:tgt;

# Cypher (https://neo4j.com/docs/cypher-manual/current/introduction/cypher-overview/)
MATCH (actor:Actor)-[:ACTED_IN]->(movie:Movie {title: 'The Matrix'})

Obligatory plug for my graph datatype essay.

Metaoperators

Both Raku and J have special higher-order functions that apply to binary infixes. Raku calls them metaoperators, while J calls them adverbs and conjugations.

# Raku

# `a «op» b` is map, "cycling" shorter list
say <10 20 30> «+» <4 5>
(14 25 34)

# `a Rop b` is `b op a`
say 2 R- 3
1
NB. J

NB. x f/ y creates a "table" of x f y
   1 2 +/ 10 20
11 21
12 22

The Raku metaoperators are closer to what I'm looking for, since I don't think you can assign the "created operator" directly to a callable variable. J lets you, though!

   h =: +/
   1 2 h 3 4
4 5
5 6

That said, J has some "decomposable" ternaries that feel spiritually like ternaries, like amend and fold. It also has a special ternary-ish contruct called the "fork".3 x (f g h) y is parsed as (x f y) g (x h y):

NB. Max - min
   5 (>. - <.) 2
3
   2 (>. - <.) 5
3

So at the top level that's just a binary operator, but the binary op is constructed via a ternary op. That's pretty cool IMO.

Assignment Ternaries

Bob Nystrom points out that in many languages, a[b] = c is a ternary operation: it is not the same as x = a[b]; x = c.

A weirder case shows up in Noulith and Raku (again): update operators. Most languages have the += binary operator, these two have the f= ternary operator. a f= b is the same as a = f(a, b).

# Raku
> my $x = 2; $x max= 3; say $x
3

Arguably this is just syntactic sugar, but I don't think it's decomposable into binary operations.

Custom user ternaries

Tikhon Jelvis pointed out that Agda lets you define custom mixfix operators, which can be ternary or even tetranary or pentanary. I later found out that Racket has this, too. Objective-C looks like this, too, but feels different somehow.

Near Misses

All of these are arguable, I've just got to draw a line in the sand somewhere.

  • Regular expression substitutions: s/from/to/flags seems like a ternary, but I'd argue it a datatype constructor, not an expression operator.
  • Comprehensions like [x + 1 | x <- list]: looks like the ternary [expr1 | expr2 <- expr3], but expr2 is only binding a name. Arguably a ternary if you can map and filter in the same expression a la Python or Haskell, but should that be considered sugar for
  • Python's operator chaining (1 < x < 5): syntactic sugar for 1 < x and x < 5.
  • Someone suggested glsl swizzles, which are very cool but binary operators.

Why are ternaries so rare?

Ternaries are somewhat more common in math and physics, f.ex in integrals and sums. That's because they were historically done on paper, where you have a 2D canvas, so you can do stuff like this easily:

10
Σ    n
n=0

We express the ternary by putting arguments above and below the operator. All mainstream programming languages are linear, though, so any given symbol has only two sides. Plus functions are more regular and universal than infix operators so you might as well write Sum(n=0, 10, n). The conditional ternary slips through purely because it's just so darn useful. Though now I'm wondering where it comes from in the first place. Different newsletter, maybe.

But I still find ternary operators super interesting, please let me know if you know any I haven't covered!


Blog Rec

This week's blog rec is Alexis King! Generally, Alexis's work spans the theory, practice, and implementation of programming languages, aimed at a popular audience and not an academic one. If you know her for one thing, it's probably Parse, don't validate, which is now so mainstream most people haven't read the original post. Another good one is about modeling open-world systems with static types.

Nowadays she is far more active on Programming Languages Stack Exchange, where she has blog-length answers on reading type notations, compiler design, and why arrows.


  1. Unless it's a lisp. 

  2. Or x if bool else y, same thing. 

  3. I say "ish" because trains can be arbitrarily long: x (f1 f2 f3 f4 f5) y is something I have no idea how to parse

TLA from first principles

22 October 2024 at 17:14

No Newsletter next week

I'll be speaking at USENIX SRECon!

TLA from first principles

I'm working on v0.5 of Logic for Programmers. In the process of revising the "System Modeling" chapter, I stumbled on a great way to explain the Temporal Logic of Actions that TLA+ is based on. I'm reproducing that bit here with some changes to fit the newsletter format.

Note that by this point the reader has already encountered property testing, formal verification, decision tables, and nontemporal specifications, and should already have a lot of practice expressing things as predicates.


The intro

We have some bank users, each with an account balance. Bank users can wire money to each other. We have overdraft protection, so wires cannot reduce an account value below zero.

For the purposes of introducing the ideas, we'll assume an extremely simple system: two hardcoded variables alice and bob, both start with 10 dollars, and transfers are only from Alice to Bob. Also, the transfer is totally atomic: we check for adequate funds, withdraw, and deposit all in a single moment of time. Later [in the chapter] we'll allow for multiple nonatomic transfers at the same time.

First, let's look at a valid behavior of the system, or possible way it can evolve.

alice   10 ->  5 -> 3  -> 3  -> ...
bob     10 -> 15 -> 17 -> 17 -> ...

In programming, we'd think of alice and bob as variables that change. How do we represent those variables purely in terms of predicate logic? One way is to instead think of them as arrays of values. alice[0] is the initial state of alice, alice[1] is after the first time step, etc. Time, then, is "just" the set of natural numbers.

Time  = {0, 1, 2, 3, ...}
alice = [10, 5, 3, 3, ...]
bob   = [10, 15, 17, 17, ...]

In comparison to our valid behavior, here are some invalid behaviors:

alice = [10, 3,  ...]
bob   = [10  15, ...]

alice = [10, -1,  ...]
bob   = [10  21,  ...]

The first is invalid because Bob received more money than Alice lost. The second is invalid because it violates our proposed invariant, that accounts cannot go negative. Can we write a predicate that is true for valid transitions and false for our two invalid behaviors?

Here's one way:

Time = Nat // {0, 1, 2, etc}

Transfer(t: Time) =
  some value in 0..=alice[t]:
    1. alice[t+1] = alice[t] - value
    2. bob[t+1] = bob[t] + value

Go through and check that this is true for every t in the valid behavior and false for at least one t in the invalid behavior. Note that the steps where Alice doesn't send a transfer also pass Transfer; we just pick value = 0.

I can now write a predicate that perfectly describes a valid behavior:

Spec = 
  1. alice[0] = 10
  2. bob[0]   = 10
  3. all t in Time:
    Transfer(t)

Now allowing "nothing happens" as "Alice sends an empty transfer" is a little bit weird. In the real system, we probably don't want people to constantly be sending each other zero dollars:

Transfer(t: Time) =
- some value in 0..=alice[t]:
+ some value in 1..=alice[t]:
    1. alice[t+1] = alice[t] - value
    2. bob[t+1] = bob[t] + value

But now there can't be a timestep where nothing happens. And that means no behavior is valid! At every step, Alice must transfer at least one dollar to Bob. Eventually there is some t where alice[t] = 0 && bob[t] = 20. Then Alice can't make a transfer, Transfer(t) is false, and so Spec is false.1

So typically when modeling we add a stutter step, like this:

Spec =
  1. alice[0] = 10
  2. bob[0]   = 10
  3. all t in Time:
    || Transfer(t)
    || 1. alice[t+1] = alice[t]
       2. bob[t+1] = bob[t]

(This is also why we can use infinite behaviors to model a finite algorithm. If the algorithm completes at t=21, t=22,23,24... are all stutter steps.)

There's enough moving parts here that I'd want to break it into subpredicates.

Init =
  1. alice[0] = 10
  2. bob[0]   = 10

Stutter(t) =
  1. alice[t+1] = alice[t]
  2. bob[t+1] = bob[t]

Next(t) = Transfer(t) // foreshadowing

Spec =
  1. Init
  2. all t in Time:
    Next(t) || Stutter(t)

Now finally, how do we represent the property NoOverdrafts? It's an invariant that has to be true at all times. So we do the same thing we did in Spec, write a predicate over all times.

property NoOverdrafts =
  all t in Time:
    alice[t] >= 0 && bob[t] >= 0

We can even say that Spec => NoOverdrafts, ie if a behavior is valid under Spec, it satisfies NoOverdrafts.

One of the exercises

Modify the Next so that Bob can send Alice transfers, too. Don't try to be too clever, just do this in the most direct way possible.

Bonus: can Alice and Bob transfer to each other in the same step?

Solution [in back of book]: We can rename Transfer(t) to TransferAliceToBob(t), write the converse as a new predicate, and then add it to next. Like this

TransferBobToAlice(t: Time) =
  some value in 1..=bob[t]:
    1. alice[t+1] = alice[t] - value
    2. bob[t+1] = bob[t] + value

Next(t) =
  || TransferAliceToBob(t)
  || TransferBobToAlice(t)

Now, can Alice and Bob transfer to each other in the same step? No. Let's say they both start with 10 dollars and each try to transfer five dollars to each other. By TransferAliceToBob we have:

1. alice[1] = alice[0] - 5 = 5
2. bob[1] = bob[0] + 5 = 15

And by TransferBobToAlice, we have:

1. bob[1] = bob[0] - 5 = 5
2. alice[1] = alice[0] + 5 = 15

So now we have alice[1] = 5 && alice[1] = 15, which is always false.

Temporal Logic

This is good and all, but in practice, there's two downsides to treating time as a set we can quantify over:

  1. It's cumbersome. We have to write var[t] and var[t+1] all over the place.
  2. It's too powerful. We can write expressions like alice[t^2-5] = alice[t] + t.

Problem (2) might seem like a good thing; isn't the whole point of logic to be expressive? But we have a long-term goal in mind: getting a computer to check our formal specification. We need to limit the expressivity of our model so that we can make it checkable.

In practice, this will mean making time implicit to our model, instead of explicitly quantifying over it.

The first thing we need to do is limit how we can use time. At a given point in time, all we can look at is the current value of a variable (var[t]) and the next value (var[t+1]). No var[t+16] or var[t-1] or anything else complicated.

And it turns out we've already seen a mathematical convention for expressing this: priming!2 For a given time t, we can define var to mean var[t] and var' to mean var[t+1]. Then Transfer(t) becomes

Transfer =
  some value in 1..=alice:
    1. alice' = alice
    2. bob' = bob

Next we have the construct all t in Time: P(t) in both Spec and NoOverdrafts. In other words, "P is always true". So we can add always as a new term. Logicians conventionally use □ or [] to mean the same thing.3

property NoOverdrafts =
  always (alice >= 0 && bob[t] >= 0)
  // or [](alice >= 0 && bob[t] >= 0)

Spec =
  Init && always (Next || Stutter)

Now time is almost completely implicit in our spec, with just one exception: Init has alice[0] and bob[0]. We just need one more convention: if a variable is referenced outside of the scope of a temporal operator, it means var[0]. Since Init is outside of the [], it becomes

Init =
  1. alice = 10
  2. bob = 10

And with that, we've removed Time as an explicit value in our model.

The addition of primes and always makes this a temporal logic, one that can model how things change over time. And that makes it ideal for modeling software systems.

Modeling with TLA+

One of the most popular specification languages for modeling these kinds of concurrent systems is TLA+. TLA+ was invented by the Turing award-winner Leslie Lamport, who also invented a wide variety of concurrency algorithms and LaTeX. Here's our current spec in TLA+:

---- MODULE transfers ----
EXTENDS Integers

VARIABLES alice, bob
vars == <>

Init ==
  alice = 10 
  /\ bob = 10

AliceToBob ==
  \E amnt \in 1..alice:
    alice' = alice - amnt
    /\ bob' = bob + amnt

BobToAlice ==
  \E amnt \in 1..bob:
    alice' = alice + amnt
    /\ bob' = bob - amnt

Next ==
  AliceToBob
  \/ BobToAlice

Spec == Init /\ [][Next]_vars

NoOverdrafts ==
  [](alice >= 0 /\ bob >= 0)

====

TLA+ uses ASCII versions of mathematicians notation: /\/\/ for &&/||, \A \E for all/some, etc. The only thing that's "unusual" (besides == for definition) is the [][Next]_vars bit. That's TLA+ notation for [](Next || Stutter): Next or Stutter always happens.


The rest of the chapter goes on to explain model checking, PlusCal (for modeling nonatomic transactions without needing to explain the exotic TLA+ function syntax), and liveness properties. But this is the intuition behind the "temporal logic of actions": temporal operators are operations on the set of points of time, and we restrict what we can do with those operators to make reasoning about the specification feasible.

Honestly I like it enough that I'm thinking of redesigning my TLA+ workshop to start with this explanation. Then again, maybe it only seems good to me because I already know TLA+. Please let me know what you think about it!

Anyway, the new version of the chapter will be in v0.5, which should be out mid-November.


Blog Rec

This one it's really dear to me: Metadata, by Murat Demirbas. When I was first trying to learn TLA+ back in 2016, his post on using TLA+ in a distributed systems class was one of, like... three public posts on TLA+. I must have spent hours rereading that post and puzzling out this weird language I stumbled into. Later I emailed Murat with some questions and he was super nice in answering them. Don't think I would have ever grokked TLA+ without him.

In addition to TLA+ content, a lot of the blog is also breakdowns of papers he read— like the morning paper, except with a focus on distributed systems (and still active). If you're interested in learning more about the science of distributed systems, he has an excellent page on foundational distributed systems papers. But definitely check out his his deep readings, too!


  1. In the book this is presented as an exercise (with the solution in back). The exercise also clarifies that since Time = Nat, all behaviors have an infinite number of steps. 

  2. Priming is introduced in the chapter on decision tables, and again in the chapter on database invariants. x' is "the next value of x", so you can use it to express database invariants like "jobs only move from ready to started or aborted." 

  3. I'm still vacillating on whether I want a "beyond logic" appendix that covers higher order logic, constructive logic, and modal logic (which is what we're sneakily doing right now!)

    While I'm here, this explanation of always as all t in Time isn't 100% accurate, since it doesn't explain why things like [](P => []Q) or <>[]P make sense. But it's accurate in most cases and is a great intuition pump. 

My patented Miracle Tonic would have prevented the CrowdStrike meltdown

23 July 2024 at 15:22

Last Friday CrowdStrike did something really bad and it destroyed every airport in the world. I didn't bother to learn anything else about it because I was too busy writing my 10k whitepaper about how all the problems were all caused by one simple mistake: not drinking my patented Miracle Tonic™®.

Developers who drink my Miracle Tonic write code 100% faster with 100% fewer bugs. This would have prevented the CrowdStrike outage, the 2016 DNC hack, Ariane 5, Therac-25, and that one moth caught in the Harvard Mark II. Developers are also happier at work, suffer no burnout, and keep all standups to exactly five minutes.

The Miracle Tonic is so effective that it should be immoral not to drink it. It's like if surgeons didn't wash their hands. If you write code for a living and you don't drink my Miracle Tonic, do us all a favor and never touch a computer again. You idiot. You absolute moron. I can't believe you call yourself a "professional".

Frequently Asked Questions:

Do you have any actual evidence that Miracle Tonic actually helps programmers?

Yes I do! All sorts of studies prove the effectiveness of Miracle Tonic:

  1. One survey found that 68% of devs drinking miracle tonic say their code is "awesome". That means it works!
  2. A double-blind clinical trial found that 7 undergraduates who were given Miracle Tonic wrote 10% better code (using the Hills-Bourne-Davidio Corrected Dolorimetry metric) than the control group (6 undergraduates who were punched in the face).
  3. Someone found twelve projects on GitHub that didn't use Miracle Tonic and they had 268% worse failure rates than this one project I did with it. That's a P value of 0.00004!

That's so many studies! I can say with 100% confidence that Miracle Tonic is proven to work beyond a shadow of a doubt.

I read a study saying that Miracle Tonic gives people headaches.

Why are you trusting studies? Are you really gonna listen to some graduate student dweeb who's never been a real programmer?! If you're curious about Miracle Tonic, just try it and see for yourself.

Are there any downsides to drinking Miracle Tonic?

Of course, there is no silver bullet, everything is tradeoffs, etc etc etc. The downside of Miracle Tonic is it doesn't work if your company is a toxic feature factory that cares more about micromanaging mindless worker drones than cultivating good engineers. And don't drink it if you're allergic to peanuts.

This tastes revolting.

I've trained five Miracle Tonic Brand Ambassadors and they all tell me it's delicious. Your taste buds must be wrong.

I tried drinking your Miracle Tonic and it didn't make me a better programmer.

How dare you. How dare you spread FUD about the most important programming technology ever made. Were you drinking exactly 12.063 mL at 67° C every 75 minutes? Yeah, thought not. Of course it's not going to work if you didn't follow it properly. And you'd know this if you took my $1500 three-day "how do drink Miracle Tonic" course. Which you didn't. Get out of my sight.

How does Miracle Tonic compare to competing products, like toad oil, Pirelli's Miracle Elixir, or Design-by-Contract?

Fucking charlatans, all of them.


This is the part of my job I dread.

I do formal methods for a living. Last year 100% of my income came from either writing specifications or training others how to write specifications. This year, due to a lot of work in diversifying my revenue streams, it will only be 90% of my income. This creates an immense pressure to become an ambulance chaser, to see ten billion dollars in economic damage as a chance to say "look at me!", a longshot chance to find the next source who'll pay my next month's rent.

I'm also a True Believer. Of course formal methods would have prevented the CrowdStrike outage! It also would have prevented COVID-19, the Cuban Missile Crisis and the Lincoln assassination. Years of being an advocate have shaped my worldview to always see how it could have helped, how everything is just the right shape nail for the hammer I'm selling you.

None of this depends on anything particular to formal methods, which is why advocates of every stripe are telling us "what they should have done", because every advocate is a true believer, and every advocate wants to get paid. I understand this all, and I get why people do this, but I hate feeling this way, that all this misery begets opportunity. I hate the pressure to say "they needed X" as fast as possible, before people lose interest, regardless of whether X would help or even if they were already using X. I hate the feeling that capitalizing on this might compromise my principles.

Most of all though, I fear the slippery slope from advocating to grifting. There are people out there who saw the outage and got excited. Please, God, let me never be that.

Keep perfecting your config

16 July 2024 at 19:18

First of all, I wanted to extend a huge and heartfelt thank you to all of the people who bought Logic for Programmers. Seeing the interest is incredible motivation to continue improving it. If you read it and have feedback, please share it with me!

Second, I have a new blogpost out: Toolbox Languages. It's about five obscure languages I use that make certain hard problems I have much easier. I hope it encourages other people to talk about their toolbox languages (or share new ones with me!). Patreon blog notes (and the cut sixth language) here.

Third, actual newsletter.

Keep perfecting your config

Last week I read the article stop perfecting your config, which argues that people spend too much time tinkering with "editors, IDEs, terminals [and] other applications". I firmly believe the opposite, that people should spend more time tinkering. To be fair, the OP says

Before I continue, let’s make one thing clear. By no means am I saying that there is something wrong with trying to master your tools or attempting to adjust them to suit your specific needs. It’s actually the opposite.

So I read this more as "don't make my mistakes" than "don't do what I don't like."1 At the same time, it's notable are "his mistakes"— what he counted as perfecting his configs. Reading the article, he mentions Neovim keymaps once, plugins once, and color schemes four times.

There's an idea in programming language design called Wadler's Law, which is that the majority of discussion is on syntax and not semantics. I imagine this applies to configuration too: the majority of time spent "tinkering" is on layout and coloring and not on changes to the tool's behavior. This is a trap. You get the most value of "configuration" when you use it to improve your workflow.

Here's a Neovim script I wrote a while back:

function LoadLocal(local_task)
  vim.b.local_task = local_task
end

function RunLocal()
  vim.cmd(vim.b.local_task)
end

vim.cmd [[command! -nargs=1 LoadLocal call v:lua.LoadLocal(<f-args>)]]
vim.keymap.set('n', 'gxl', RunLocal, {desc="Buffer Task"})

On calling :LoadLocal foo it loads foo as the local command for that specific buffer. Then, typing gxl executes foo. The command can be anything, up to and including shelling out to !rm *.

Oh, also any other script can set vim.b.local_task.2 LoadLocal can be used by the rest of my configs. When I open an XML file it sets the local task to the appropriate transformation script.

Those eight lines have probably saved me at least eight hours of drudgework over their lifespan. The full taskrunner has saved even more. That's the kind of customization that's most useful.

Anything that goes for an IDE goes double for the shell. I use a ton of shell scripts.

Customization teaches you

A common argument against customization is that you shouldn't do it until you know the tooling basics well. For example, you shouldn't install a filetree plugin for neovim because it comes with Netrw. In other words, there's a difference between the skill of using your tools and the skill of customizing them, and the first is more important than the second.

But just as often customization promotes mastery, by smoothing out the starting friction with using a feature. Say you're trying to get more comfortable with vim macros. Here's a keymapping that could help:

vim.keymap.set('n', "<leader>Q", [[:let @q = input("Edit macro:", @q)<CR>]])

Vim macros are stored in a text registry: recording the @q macro will overwrite whatever you copied to the "q register. This means that you can treat a macro as text. input will fill a prompt with the @q macro text, let you edit it as text, and then save your changes back to the register.

This makes it so much easier to experiment with macros! If you screw up a macro while recording, you don't need to start over, you can just finish and make an edit afterwards. If you try a finished macro and it does something wrong, same thing, just undo and edit it.

Reasons not to customize

I've heard two other reasons why not to heavily customize your system:

  1. It'll be harder for other people to pair with you on your machine
  2. It'll be harder for you to work on machines without the customizations: servers you SSHed into, VMs, other people's computers, etc.

(1) is only a problem if you're modifying default keybindings or adding surprising new ones. I deal with it by having two editors: a heavily customized Neovim for solo work, a vanilla VSCode (plus plugins) for collaboration.

(2) is a bigger problem. I find the problem is how it messes with muscle memory. In vim I've mapped H/L to beginning/end of line when normally they mean "first/last line in view", and it always screws me up. It is so immensely frustrating. Other than that, though, I generally can tolerate losing my config for a bit. Most of the stuff that's most useful to me are things that aren't useful when I'm remoting, so I don't miss them.

One good reason not to customize that I haven't heard: There is one more reason I've heard not to customize: you can easily break stuff that you don't know how to fix. This is more true with stuff like vim and shell that allow arbitrary scripts as config, less so with VSCode or the new crop of modal editors.


  1. Also he wrote positively about configuration later

  2. I swear to God if this leads to someone targeting an exploit at me I will quit programming forever 

Logic for Programmers now in early access!

8 July 2024 at 17:33

I am delighted to announce that Logic for Programmers is now available for purchase! While still in early access, it's almost 20,000 words, has 30 exercises, and covers a wide variety of logic applications:

  • Property testing
  • Functional correctness and contracts
  • Formal verification
  • Decision tables
  • Database constraints
  • Data modeling and alloy
  • Constraint solving
  • Logic programming

I want to emphasize the book is early access; meaning it's not close to done yet. I'm releasing it now so I can get feedback from readers and use that to decide what I write next. But I'm only willing to release it because I think, even in its current state, it's worth reading. I'm just promising it'll be even better in a few months.

What's next

I think exercise solutions are broken on epub, so step one is getting that fixed.

After that, I'm taking a short break. I put basically all of June into working on this and need to get back to blogging and client acquisition. Plus I need a bit of time to gather reader feedback.

I'm thinking of new major releases being either every two weeks or every month or so. I'll upload new versions of the book between releases and then send a Leanpub publication announcement when we hit the next milestone.

PS

There's a free book coupon on the Patreon. Enough people have joined that I feel obligated to start posting more original content there why does this keep happening to me

Solving a math problem with planner programming

2 July 2024 at 15:46

The deadline for the logic book is coming up! I'm hoping to have it ready for early access by either the end of this week or early next week. During a break on Monday I saw this interesting problem on Math Stack Exchange:

Suppose that at the beginning there is a blank document, and a letter "a" is written in it. In the following steps, only the three functions of "select all", "copy" and "paste" can be used.

Find the minimum number of steps to reach at least 100,000 a's (each of the three operations of "select all", "copy" and "paste" is counted as one step). If the target number is not specified, and I want to get the exact amount of a, is there a general formula?

The first two answers look for analytic solutions. The last answer shares a C++ program that finds it via breadth-first search. I'll reproduce it here:

#include <iostream>
#include <queue>

enum Mode
{
    SELECT,
    COPY,
    PASTE
};

struct Node
{
    int noOfAs;
    int steps;
    int noOfAsCopied;
    Mode mode;
};

int main()
{
    std::queue<Node> q;

    q.push({1, 0, 0, SELECT});

    while (!q.empty())
    {
        Node n = q.front();
        q.pop();

        if (n.noOfAs >= 100000)
        {
            std::cout << n.steps << std::endl;
            break;
        }

        switch (n.mode)
        {
        case SELECT:
            q.push({n.noOfAs, n.steps + 1, n.noOfAsCopied, COPY});
            break;
        case COPY:
            q.push({n.noOfAs, n.steps + 1, n.noOfAs, PASTE});
            break;
        case PASTE:
            q.push({n.noOfAs, n.steps, n.noOfAsCopied, SELECT});
            q.push({n.noOfAs + n.noOfAsCopied, n.steps + 1, n.noOfAsCopied, PASTE});
            break;
        }
    }

    return 0;
}

This is guaranteed to find a shortest possible solution due to a fun property of BFS: the distance of nodes to the origin never decreases. If you evaluate Node Y after Node X, then Y.dist >= X.dist, meaning that the first valid solution will be a shortest possible solution. I should make this into a logic book example!

This also has the drawback of preventing the use of an insight. We should be able to fuse the select and copy steps together, meaning instead of having three actions (select, copy, paste) we only need two (selectcopy, paste), where selectcopy takes twice as many steps as pasting.

But we can't make that optimization because it breaks monotonicity. We're now pushing a mix of n+1 and n+2 steps onto the queue, and there's no way to order things to guarantee all of the n+1 steps are searched before any n+2 step.

I thought I'd try to solve it with planning language instead, so we can get both the elegant solution and the optimization.

Planning

The rough idea of planning is that you provide an initial state, a set of actions, and a target, and the tool finds the shortest sequence of actions that reaches the target. I've written about it in-depth here and also a comparison of planning to model checking here. I like how some tough problems in imperative and functional paradigms become easy problems with planning.

This is all in Picat, by the way, which I've talked about more here and in the planning piece. I'll just be explaining the planning stuff specific to this problem.

import planner.
import util.

main =>
  Init = $state(1, 0) % one a, nothing copied
  , best_plan(Init, Plan, Cost)
  , nl
  , printf("Cost=%d%n", Cost)
  , printf("Plan=%s%n", join([P[1]: P in Plan], " "))
  .

We're storing the state of the system as two integers: the number of characters printed and the number of characters on our clipboard. Since we'll be fusing selects and copies, we don't need to also track the number of characters selected (unlike the C++).

final(state(A, _)) => A >= 100000.

action(state(A, Clipboard), To, Action, Cost) ?=>
  NewA = A + Clipboard
   , To = $state(NewA, Clipboard)
   , Action = {"P", To}
   , Cost = 1
   .

The paste action just adds the clipboard to the character count. Because Picat is a research language it's a little weird with putting expressions inside structures. If we did $state(1 + 1) it would store it as literally $state(1 + 1), not state(2).

Also you have to use dollar signs for definitions but no dollar signs for pattern matching inside a function definition. I have no idea why.

action(state(A, Clipboard), To, Action, Cost) ?=>
  To = $state(A, A)
   , Action = {"SC", To}
   , Cost = 2
   .

And that's it! That's the whole program. Running this gives us:

Cost=42
Plan=SC P P SC P P SC P P SC P P SC P P SC 
     P P SC P P SC P P SC P P P SC P P P

To find if there's a sequence that gets us exactly 100,000, we just need to make one change:

- final(state(A, _)) => A >= 100000.
+ final(state(A, _)) => A = 100000.

This returns a cost of 43.

On the other hand, I can't get it to find a path that makes exactly 100,001 characters, even with some optimizations. This is because the shortest path is over 9000 steps long! I haven't checked if the C++ BFS can find it.

Metaplanning

One reason planning fascinates me so much is that, if a problem is now easy, you can play around with it. Like if I wanted to add "delete a character" as a move, that's easy:

action(state(A, Clipboard), To, Action, Cost) ?=>
  A > 0
  , NewA = A - 1
  , To = $state(NewA, Clipboard)
  , Action = {"D", To}
  , Cost = 1
  .

This doesn't make exceeding or reaching 100,000 easier, but it makes reaching 100,001 take 47 steps instead of 9000.

With some tweaks, I can also ask questions like "what numbers does it make the most easier?" or "Do some numbers have more than one shortest path? Which number has the most?"

Planning is really cool.

A brief introduction to interval arithmetic

25 June 2024 at 16:08

You've got a wall in your apartment and a couch. You measure the wall with a ruler and get 7 feet, then you measure the couch and get 7 feet. Can you fit the couch against that wall?

Maybe. If the two measure is exactly 7 feet than sure, 7 ≤ 7. But you probably didn't line your measuring stick up perfectly with the couch, and also the stick itself might be a tiny bit long. There's some uncertainty around the edges, say 1/10th of a foot.

Instead of treating each as exactly 7 feet, we can instead say that each is somewhere between a minimum of 6.9 feet and a maximum of 7.1. We can write this as an interval (6.9, 7.1). From there, we can say interval (a, b) is "definitely less than" another interval (c, d) if b < c, ie the first interval ends before the second starts. Since 7.1 !< 6.9, we can't say for certain that the couch will fit against the wall. Your couch could actually be 7.1 feet and the wall 6.9 feet, or 7.05 and 7.01 feet, anything like that.

More arithmetic

When adding/multiplying/subtracting/dividing an interval by a scalar (single number), we just apply the op to both ends. I'll be using frink because it has a built-in interval type.

  i[a, b] := new interval [a, b]
  i[2, 4] + 2
[4, 6]

  i[2, 4] * 2
[4, 8]

Operating on two intervals is more complicated. The general principle is "pick a value from each interval that, when operating, give the minimum value, and do the same for the maximum value". If you measure two lengths as 3 ± 1 foot, then the length of both of them together must be 6 ± 2 feet. We can get this by just adding the minimums and maximums of the intervals together.

  i[1, 3] + i[2, 4]
[3, 7]

If we measure a rectangle's length as (5, 7) and it's width as (3, 4), what's the area? Multiplication seems like it'd be the same as addition, just multiply the two minimums and the two maximums together:

  i[5, 7] * i[3, 4]
[15, 28]

But this is breaks down when intervals cover negative numbers. If we just multiplied minimums and maximums, (-3, 3) * (-3, 3) would give us (9, 9). But we instead get the true minimum by picking -3 from the first interval and 3 from the second, which gives us -9. So the real interval is (-9, 9).

What about division? It works kinda like multiplication, except if the dividing interval spans 0, the result is undefined. Overall, interval arithmetic is like regular arithmetic except with more headaches.

Trouble in paradise

Now, how about (-3, 3)²? On first thought we say (-3, 3)² = (-3, 3)*(-3, 3), which gives us the same interval (-9, 9). But again, that's wrong. In (-3, 3)*(-3, 3), we pick one number from each interval and multiply them together, while in (-3, 3)^2 we pick one number from the interval and multiply it by itself. So the actual interval is (0, 9).

  i[-3, 3]^2
[0, 9]

But wait, doesn't that mean that x*x != x^2? Yes, it does. Err no, it doesn't. Maybe. It depends on what we mean by "x".

The two interpretations

  1. x is some value in the interval (-3, 3), but we don't know what it is. Then x*x is the same value both times, so it should be (0, 9).
  2. x is "smeared out" over the entire interval. Then x*x can be treated as two separate intervals, giving us (-9, 9).

Now usually people use intervals to mean [1], but most interval arithmetic systems I've seen do [2], or at least a mix of both with a bias towards [2].

There's two reasons why. First of all, you can reason about [2] locally. Consider the equation y = sin(x)*cos(x) where x is the interval (0, 2π).1 Figuring out y under [2] is easy: it's just (-1*1, 1*1) = (-1, 1). But this is impossible under [1], because sin is only ±1 when cos is 0! The actual interval under [1] is (-0.5, 0.5).

Notice that (-0.5, 0.5) is a subinterval of (-1, 1). That's the other reason to assume [2] in calculations: the result will always cover [1].

Nonetheless [2] leads to lots of weird behavior, like x*x != x^2. This is called overdetermination and leads to overestimation, where your interval bounds are too wide to be that useful to you. Frink tries to be smart about this, and will in some cases rewrite interval arithmetic to avoid overdetermination, even when you want the intervals to be different.

  i[-3, 3]*i[-3, 3]
[0, 9] 

This is all moot if we assign the same value interval to (-3, 3) to different variables, though, because then the interpretations of [1] and [2] don't have any differences.

  x = i[-3, 3]
  y = i[-3, 3]
  x*y
[-9, 9]

How this affects you

# Python time!
>>> x = 0.3
>>> x + 0.00000000000000001
0.3
>>> x + 0.00000000000000002
0.3
>>> x + 0.00000000000000003
0.30000000000000004

If a (double precision) floating point calculation lands in the interval (0.3 - 3e-17, 0.3 + 3e-17), it's collapsed to the value 0.3. This has lead to some proposals to use interval arithmetic instead of floating point. An interval may be less precise than a single FP number, but it will be more accurate: you won't have an exact number but you'll know that the correct number is somewhere in the bounds. While a bunch of FP computations will always give you an exact number, but it could be very far from the actual answer.

These proposals have been around as long as floating point. The father of the IEEE-754 standard, William Kahan, wrote an essay about why interval arithmetic can't replace FP. His argument comes back to overestimation: in most computations, interval boundaries grow too big to be useful.

He wrote that back in 2006. Since then, there's been one big new interval arithmetic proposal: John Gutafson's universal numbers. I think his posit version isn't intervals but I'm not sure? This is well outside of the my area of expertise. I can tell, though, that Gustafson and Kahan hate each other.

That's the main connection between intervals and computer science. Of course, there are also particular domains where intervals are really important, such as in manufacturing tolerances.


Logic for Programmers update

14k words! Writing a bunch of exercises and examples this week.


  1. Or (0, τ) for you tau lovers out there 

Logic for Programmers Update

21 June 2024 at 18:02

I spent the early week recovering and the later week working on Logic for Programmers ([init] [update]) because I have a self-imposed deadline of mid-July, backed up by a $1000 toxx clause. Here's where I currently am:

1: The book is now 14k words. About 4k are "basics", covering propositional logic, predicate logic, and set theory. Another 1500 words or so are things like the introduction, appendixes, and preliminary exercises. The rest is "techniques". I'm hoping for 20k words for the pre-alpha first draft.

(I shared an early version of the "basics" on my patreon.)

2: Based on some feedback from Saul Pwanson, I've organized each "technique" as a three parter: first a common programming situation, then how logic applies, then something cool that knowing logic enables. For example, testing → specifications of functions → property testing. The other techniques are

  • Functional correctness → contracts
  • Data invariants → stronger database constraints
  • Case coverage → decision tables
  • Conditionals → simplifying conditionals

The last "technique" is an overview of tools that you need some knowledge of logic to start using in the first place: constraint solving, logic programming, formal verification, formal specification. I'm trying to decide if-and-how to cover Z3 and SAT solving, or if there's anything else I'm missing.

2.5: Two techniques I haven't written about but think could be good material: state invariants, modeling requirements. Not sure about the best approaches to either. I'm also looking for other possible applications of logic in general. The key is that it can't just be something that logic helps you appreciate, logic has to help you do it better.

3: Beyond adding those new sections, I want to have more details of technique, more examples, and some exercises. I don't know the right proportion of them yet.

I wrote a special exercise directive: for html, it puts the solution in a dropdown, while for pdfs it puts it in the back of the book. I really like how Sphinx supports these kinds of extensions, though I wish the ecosystem was a little more robust (or the docs better).

4: I think that when the alpha version is out in July, the book is about 40% done. This is because I suspect that reader feedback will lead to a lot of revisions and improvements to the content, as well as ideas for new technique chapters. After that's incorporated, it's 80% done, and that's when I'll figure out nice typesetting and a good cover image.

5: There's now a leanpub page. I wanted to figure out the logistics of distribution early. Please do not buy it yet.

Have a great weekend!

Nondeterminism in Formal Specification

11 June 2024 at 14:30

Just an unordered collections of thoughts on this.

In programming languages, nondeterminism tends to come from randomness, concurrency, or external forces (like user input or other systems). In specification languages, we also have nondeterminism as a means of abstraction. Say we're checking if a user submitted the correct password. A software implementation might look something like this:

# pythonish
def attempt_login(user, pw):
    submitted_hash = hash_password(pw)
    actual_hash = db.get_password(uid=user.id)
    if submitted_hash == actual_hash:
        login()
    else:
        reject()

There's lots of implementation details in confirming the correct password: how we hash it, how we determine the right hashing mechanism, where we retrieve the actual password from, what query we use to retrieve it, etc. At a higher level of design, though, we might not care about any of that: we only want to model what happens if the password is correct, not how we determine correctness. So we can use nondeterminism to handwave that all away:

\* TLA+
AttemptLogin ==
    \/ Login \* \/ means `either`
    \/ RejectLogin

When the model checker tries to evaluate AttemptLogin, it will branch and create separate checking executions for picking Login or RejectLogin. This gives us the consequences of "user submits the wrong password" without having to write a bunch of spec figuring out what the wrong password is.

Mastering abstraction-nondeterminism is key to writing good specifications. It's why you can model a complex, detailed system with a much smaller specification and find bugs that map back to actual possibilities.

Two challenges of nondeterminism

  1. Nondeterminism is incredibly capable, which makes it not very tractable (ibid). It's hard to express properties on subsets of the model-space, like "If a user never gets rejected, then property XYZ always holds."
  2. Abstraction-nondeterminism pushes the spec further from the real-world implementation, so it's harder to see that your implementation matches it.

These tend to be bigger problems if the nondeterminism is deep in your design, kinda like functional core, imperative shell. We want to push it closer to the edges:

AttemptLogin(pwd) ==
    IF pwd = GOOD THEN
        Login
    ELSE
        RejectLogin

Next ==
    \E pwd \in {GOOD, BAD}:
        AttemptLogin(pwd)

It's still nondeterministic, but it happens earlier, so is easier to control. We could for example swap {GOOD, BAD} with {GOOD} to guarantee that login is always successful.

In the extreme case, each component of the machine is locally deterministic, even if the components are concurrent. This is closest to something implementable. The world (ibid) can continue to be nondeterministic.

Initial states and prophecy variables

Specification languages can also have multiple initial states. We can use this to push nondeterminism into the spec setup by adding variables that encode what events will happen. For example, we could initialize the variable attempts_needed as an integer in 0..3, then write AttemptLogin like this:

AttemptLogin(pwd) ==
    IF pwd = GOOD THEN
        Login
    ELSE
        RejectLogin

Next ==
    IF attempts_needed > 0 THEN
        /\ AttemptLogin(BAD)
        /\ attempts_needed' = attempts_needed - 1
    ELSE 
        /\ AttemptLogin(GOOD)

Now, for each initial state, the spec is wholly deterministic, and thus tractable. We can easily distinguish properties that hold if the user never sends the wrong password (attempts_needed = 0) vs if they get it wrong a few times.1 We can also use this to check only a subset of the specification at a time, which can help with checking very large state spaces.

Leslie Lamport (the inventor of TLA+) calls these kinds of helpers prophecy variables, because they "predict" what nondeterministic action the system will take. You can't always use prophecy variables to completely remove nondeterminism from your specification, but it can help!

Refinement and nondeterminism

One of the main use cases of refinement— proving one specification implements another— is to remove abstraction nondeterminism from a system. I really need to add something about this to learntla.

An Exercise

I think this might be a good exercise to practice writing nondeterminism:

In the game Get Fifteen, players take turns picking numbers 1..9. Players can pick each number only once and cannot pick numbers their opponent picked. Winner is the first player to have three picked numbers that add to 15. This is identical to playing TicTacToe, but is easier to formally specify (no grid).2

Model a player tictactoe against an AI. Either can start first. The player needs to follow a deterministic strategy you implement. The AI just nondeterministically picks any available number. The property to check is "the player never loses".

(I haven't actually modeled this, but if people give it an honest shot and run into problems I'd be happy to give the specs a look!)


Plan for June: Logic for Programmers

I've decided to make a serious attempt to write an acceptable first draft of Logic for Programmers by July 15. If I don't have something I can readily turn into a Leanpub early-access by then, I will give $1k to a random newsletter subscriber (or donate it to a charity of their choice). I reserve the right to push the deadline if I get a new consulting contract.


  1. one thing we've lost is the user can no longer fail infinite times; they're always guaranteed to log in eventually. This is fixable in a few ways. 

  2. If you're not seeing it, try overlaying a tictactoe board on top of a magic square

I've been thinking about tradeoffs all wrong

5 June 2024 at 15:14

Sup nerds, I'm back from DDD Europe! Still processing the conference and adjusting to the horrible jetlag but also itching to get writing again, so let's go.

The opening keynote to the conference was called "Modern Tradeoff Analysis for Software Architecture". It was mostly about the factors that went into whether you should split up services into microservices and glue them together. Unsurprisingly "tradeoff" came up a whole lot: between throughput and latency, between simple and distributed transactions, between stability and volatility, etc. At the end he gave the example of comparing a queue-based architecture (SQS) to a pub-sub topic-based architecture (SNS).1 He said that the queues made it harder to add new services, but the topics make it harder to do access-control.2

That's when I realized I've been thinking about tradeoffs all wrong!

See, normally I think of tradeoffs as framing two positive things, like saying "SQS has better control while SNS is easier to add new services". Instead the speaker framed it as two wholly negative things, and the tradeoff is which negative thing you want less of. We can do this with any tradeoff, because any positive quality about X can be turned into a negative quality of Y. Instead of "X is faster, Y uses less space", we can say "X uses more space, Y is slower."

I don't know if this framing was intentional on the speaker's part, but I like it so much more. What's the difference? Even if you pick the least-bothersome drawback, you still have a drawback. Going back to the SQS vs SNS example, let's say you hate the how hard adding new services to SQS is, so you switch to SNS. Then you realize it hard to do access-control, and this is a worse problem than the SQS problem was. So the right tradeoff is to go back to using SQS.

But now it's hard to add new services again. And that sucks! You don't get to stop thinking about the pros and cons just because you made a decision. There's still a problem you have to deal with.

Two rules of thumb

Rule one: after I make a decision about a tradeoff, I need a plan to address the negative consequences of my choice, in a way that doesn't also add in the negative consequences of the other choice.

Rule two: this explains a lot of the movement (or more cynically, churn) in software tech. Even when we make the right choice overall, in the moment we're still under a lot of pain, and so we look for new tools/technologies/practices that address the pain we currently feel.

Bonus rule: part of the reason we do Software Mimicry is to reduce the pain of a choice we made.


The conference talk was my last big project for spring. My consulting calendar is free so I'll try to devote some time to my personal projects in "progress" (learntla updates, alloy workshop, logic book). It's a little tough without a hard deadline!

No "things I'm curious about" because I'm still v. tired


  1. He used SQS and SNS because they're the names of the AWS services. I'm going to use them as shorthands for the architectures because SQS is easier to write than "queue-based architecture" 

  2. I don't know if this actually true, but I'll go with it because that's how he framed it. 

NoCode Will Not Bring Computing to the Masses

21 May 2024 at 17:54

No Newsletter next week

I'll be giving my conference talk at DDD Europe.

NoCode Will Not Bring Computing to the Masses

I don't have a whole lot of time this week so here's something that's been on my mind a little. I haven't researched any of the factual claims; consider this more some incoherent early-stage theory.

Define a "NoCode" tool as "any tool made to help people without any programming experience do things that normally require programming." Most software engineers have a very dim view of NoCode, mostly because the primary way SEs interact with it is to clean up the messes left by overly ambitious nocoders. I don't think it'd be controversial to say there's a wide variety of problems nonprogrammers should not try to solve with NoCode tooling.

On the other side of the spectrum, there are a lot of problems that are trivial to do if you can program, and impossible without it. The canonical example I use is "upcasing a string"; others are "find strings of digits in a document", "figure out each student's average score across several tests", etc. If you don't know how to program, you either have to find someone else's program that does this thing or do it manually. This is where I think NoCode has real potential to help the average computer user: it can let them do simple programming tasks without having to learn all of programming first.

The goal, then, is to figure out what defines the space of "trivial for programmers, impossible for regulars". I think a good first-metric would be "equivalent lines of python". If something can be done in L=10 lines of not-very-clever Python then it should hypothetically be nocodable. If it takes H=50 lines, it definitely should not be. In between is a grey area. These numbers are just off the top of my head; finding the real values of L and H would require a lot of research and user examples.1

NoCode adapted to this space probably wouldn't be that profitable, but it would do the most good for the most number of people.

"The most good" will do the least good

I wanted to talk about some of the consequences for NoCode design, but couldn't figure out where to start, so did the wise thing and took a walk. On the walk I realized that no matter how much a tool empowers the user, almost nobody will use it.

Proof: how many programmers have iPhones? How many of them use Shortcuts? I mean, I have Tasker on Android, I know how to use it and have used it in the past, and still I regularly don't use it.

It's not enough for a tool to solve your problem for you to use that tool. You have to also A) recognize the problem you have is solvable, B) recognize the tool can solve it, and C) be reasonably confident that you personally can solve the problem with the tool. You need knowledge, skill, and the right mentality. Even programmers, who already have that mentality, don't automatically generalize it to all applicable domains. Otherwise we'd be doing a lot more with our phones.

It's unreasonable to expect the average person will use any NoCode to solve their problems, regardless of how easy the tools are. This leaves the people for whom it provides economic value, which is why all modern NoCode tools are oriented towards business.

Do people want grey markets?

There's two exceptions to this rule I can think of.2 The first is Excel/GSheets, which we do see regular people pick up. I'd venture this is because "very simple data organization" and "semi-complicated math stuff" is a common enough thing enough people need. Maybe also they have gentle on-ramps: you can solve basic problems while knowing very little. Finally, there could be a social aspect; everybody already knows about spreadsheets, so it's not as big a jump to start using them.

The other major exception is that I've seen a lot of people who do use tools like Shortcuts and Tasker… as consumers of other people's published automations. These tend to look more like "scripts" than "applications", suggesting that our official marketplaces are designed to deliver applications, but people also need scripts.3 The NoCode gives them a "grey market" they can use to install scripts without using the official heavyweight channels.

One negative consequence of this is that any widely-used NoCode also becomes a potent vector for malware, because the people who are running scripts are ill-equipped to audit them.

Ok, I checked the code and it said (and please bare with me here):

-If flibbity is frabjous, zibble wobble at 42% and blippity blop playlist; otherwise, schnoofle Grobber Mode.

SOMEBODY HELP — source


Trying to pull all of my thoughts into something coherent:

  1. People need scripts.
  2. NoCode can make scripting much easier.
  3. Even if NoCode makes scripting easier, people won't actually write scripts.
  4. What people want is to be easily able to find and run someone else's script like they do an application.
  5. NoCode could make finding and running other people's scripts easier, too.
  6. Doing so is potentially very dangerous, as people aren't able to evaluate scripts for safety.

  1. Have you ever noticed that High > Low, but H in vim moves you backwards and L moves you forward? Deep, man. 

  2. IFTTT/Zapier might be a third exception: people disconnecting disparate services with a NoCode tool. But I don't have enough firsthand knowledge to be able to say anything about them, and I don't know how popular they actually are. 

  3. The difference between "scripts" and "applications" seems very important to all this. See Why bother with Scripting? for a good discussion on this. 

What I look for in empirical software papers

16 May 2024 at 16:58

Behind on the talk and still reading a lot of research papers on empirical software engineering (ESE). Evaluating a paper studying software engineers is a slightly different skill than evaluating other kinds of CS papers, which feel a little closer to hard sciences or mathematics. So even people used to grinding through type theory or AI papers can find ESE studies offputting.

So here are some of my thoughts on how I approach it. This is mostly unedited and unorganized, just getting my thoughts down.

The high-level view

The goal of ESE is to understand how we develop software better in the real world. Compared to "regular" computer science, ESE has two unique problems:

  1. As with all kinds of knowledge work, software development is highly varied and lots of factors affect the final outcome.
  2. Software engineers develop new tools/processes/paradigms incredibly quickly and there's maybe a 1000:1 ratio of developers to researchers.

This makes it hard to say anything for certain with a high level of confidence. Instead of thinking of whether papers are correct or incorrect, I look for things that make me trust it more or less. I'll use this interchangeably with "like/dislike": "I like this study" is the same as "I trust it more".

Quantitative Metrics

I like metrics that are clear, unambiguous, and sensible to practitioners. "Quality" and "defects" are too ambiguous, "coupling" and "cohesion" aren't sensible. "Number of support tickets filed" and "distribution of model code lengths" are better.

In practice, clear and unambiguous metrics end up being more narrowly-scoped. I trust research on narrow metrics more than research on broad metrics, even if the former are less applicable to software at large.

Clinical Trials

I don't like "clinical trials", where developers are split into control and experiment groups and asked to complete a task. The problem is sample size: actual clinical trials in medicine are run on thousands of people at a time and cost millions of dollars. Most software clinical trials study less than 50 engineers at a time. With such a small sample size, there's just too much room for confounding factors.

It doesn't help that most studies are done on university students, which may not generalize to more experienced professionals.

There's a couple of exceptions, though. I'm more inclined to trust clinical trials how how to teach CS, as students are the target population anyway. And I pay more attention to clinical trials that find enormous effects. For example, Need for Sleep finds that a night of sleep deprivation reduced the trial group's coding quality by fifty percent.1 That's not the kind of effect you can write off as a confounding factor!

(I'd still like to see it replicated, though. Also, does mild sleep deprivation effect coding skill?)

Other Quantitative Research

Scraping GitHub for data is fine, as long as the researchers are careful to audit the data. Don't publish a groundbreaking paper claiming a correlation between language and program quality but accidentally count each bitcoin bug as 17 C++ bugs.

Developer surveys are good for measuring things like developer perceptions, emotional state, how widespread a technology is, etc. I don't think they're good at measuring the effectiveness of practices or determining if those opinions match reality. IE if a survey showed that 90% of developers think that their Java code is buggier than their Kotlin code, this tells us that developer perceive Kotlin as higher quality, not it actually is!

Qualitative Studies

I like qualitative studies, things that aren't gathering hard metrics. If a quantitative study of correctness imperative vs logic programming would be "which has fewer bugs" or "what kinds of bugs does each type have", qualitative would be more like "how do imperative and logic programmers debug stuff differently" or "let's dissect this one unusual bug and see if we learn anything interesting."

IME science fields undervalue the impact of qualitative research. I find a lot less of it than quantitative papers, but I tend to trust the qualitative papers more. Maybe they're just overall higher quality; more likely it's because they're more narrowly scoped. And I believe (without evidence) that a field needs a strong foundation of qualitative knowledge to do stellar quantitative work.

Some papers do both qualitative and quantitative work. The researchers in Abbreviated vs. Full-word Identifier Names started with a clinical trial, and then shadowed individual developers to see how they navigated a code file. I trust the quantitative sections of hybrid papers a little more than just-quantitative papers, but it's not an enormous factor.

"Quantitative" and "qualitative" aren't words anymore.

Reading the Paper

I use the same basic approach with all CS papers: read the abstract and intro, read the conclusion and post-results discussion, then decide if I care enough to look further. If I like the results, I read the methodology more carefully. I skim background info and "related work" mostly to see if they make any claims I'd want to read up on.

Sometimes I'll use an AI to generate a paper summary. I only do this if I'm going to read the full paper; the summary is just to help me digest it.

I trust papers that share their data.

Threats to validity

Most software papers have a "threats to validity" section, where they talk about possible confounding factors. In practice it's mostly a way for the writers to show they covered their bases, by explaining why said confounding factors aren't a problem. I like papers which include TtVs they can't dismiss, or discussing unusual-but-important TtVs. I don't like papers which have bad explanations for why TtVs aren't a problem, or that miss really obvious TtVs.

Example of doing it wrong: Quantifying Detectable Bugs in JavaScript claims that typescript catches 15% of "public bugs for public projects on GitHub". Their threats to validity were "we only looked at public bugs" and "we may have been too inexperienced in typescript to catch even more bugs." One TtV they do not list is that they uniformly sampled all of GitHub for bugs and didn't filter the projects at all. When I checked some of their projects, a bunch were simple lesson repos by novices in code bootcamps.2 Why didn't they mention this TtV?

Example of doing it right: oh God it's already Thursday and I haven't gotten this newsletter out yet

References

For many papers, the most useful part is the reference section, because odds are I'll find a more interesting paper there. Even if none of the references look interesting, I'll still check a few to make sure the original paper is using them correctly. I don't like it when a paper that say reference P makes claim C about topic T, and then I read R and it never even mentions T.

I wish journals let papers have two reference sections, one for the really essential references and one for the incidental ones, like background information or related work.

...Okay, that's all I got this week. Back to the talk mines.


Things I'm curious about

  • DORA Report: What's the state of research about the DORA report? Has anybody been able to replicate their findings? Has anyone evaluated them critically?
  • Puzzle Games: Puzzle games like flow free have thousands of levels. Are these programmatically generated? If so, how is it done, and how do they control the difficulty?

  1. Measured as as "number of tests passed", from a researcher-provided test suite for the task. 

  2. I checked this back in 2019 or so, but it seems like the data is now offline. I checked this back in 2020 or so. Props to making the data available! 

Paradigms succeed when you can strip them for parts

7 May 2024 at 16:48

I'm speaking at DDD Europe about Empirical Software Engineering!1 I have complicated thoughts about ESE and foolishly decided to update my talk to cover studies on DDD, so I'm going to be spending a lot of time doing research. Newsletters for the next few weeks may be light.

The other day I was catching myself up on the recent ABC conjecture drama (if you know you know) and got reminded of this old comment:

The methods in the papers in question can be used relatively quickly to obtain new non-trivial results of interest (or even a new proof of an existing non-trivial result) in an existing field. In the case of Perelman’s work, already by the fifth page of the first paper Perelman had a novel interpretation of Ricci flow as a gradient flow which looked very promising, and by the seventh page he had used this interpretation to establish a “no breathers” theorem for the Ricci flow that, while being far short of what was needed to finish off the Poincare conjecture, was already a new and interesting result, and I think was one of the reasons why experts in the field were immediately convinced that there was lots of good stuff in these papers. — Terence Tao

"Perelman's work" was proving the Poincaré conjecture. His sketch was 68 pages of extremely dense math. Tao's heuristic told him it was worth working through is that it had interesting ideas that were useful to other mathematicians, even if the final paper was potentially wrong.

I use this idea a lot in thinking about software: the most interesting paradigms are ones you can "scavenge" ideas from. Even if the whole paradigm isn't useful to you, you can pull out some isolated ideas, throw away the rest, and still benefit from it. These paradigms are the ones that are most likely to spread, as opposed to paradigms that require total buyin.

Let's take as an example functional programming (FP). ML/Haskell-style FP has a lot of interconnected ideas: higher-order functions, pattern matching, algebraic data types, immutable data structures, etc. But if you ignore all of that away and stick to writing everything in idiomatic Java/Python/C, but the only change you make is writing more pure functions, then you've already won.

This has several positive effects on FP adoption. It immediately demonstrates that FP has good ideas and is not just some weird cult practice. It also means people can "dip their toes in" and benefit without having to go all in. Then they can judge if they're satisfied or want to go further into learning FP. And if they do go further, they can learn new ideas gradually in stages.

By contrast, compare array programming languages (APLs). There's lot of cool ideas in APL, but they're so interconnected it's hard to scavenge anything out. If I want to use multidimensional arrays in Python I can't just whip it up myself, I have to import an external library like numpy and learn a whole bunch of new ideas at once. So most of the ideas in APL stay in APLs.3

A related idea how techniques are only adopted if you can "half-ass" them. TDD is useful even if you only use it sometimes and phone in the "green-refactor" steps, which helped adoption. Cleanroom depends on everybody doing it properly all the time, which hindered adoption.

Scavenging from formal methods

Part of the reason I'm thinking about this is that business has been real slow lately and I'm thinking about how I can make formal methods (FM) more popular. What ideas can we scavenge from writing specifications?

Obvious ones are predicate logic and the implication operator, but they're both poorly supported by existing programming languages, and I'm not sure how to best pitch them.2 Also, they're "too big", and it takes a lot of practice to wrap your mind around using quantifiers and implication.

Sets, definitely. Sets are unordered, unique collections of objects. Lots of things we store as lists could be sets instead. I use sets way more after learning some FM.

There's also a lot of high-level ideas in FM that usefully apply to all other code. System invariants are things that should always hold in every possible state. More broadly, we can categorize properties: "safety" is roughly "bad things don't happen", "reachability" is "good things always could happen", and "liveness" is "good things always do happen". Fairness, maybe. Refinement is definitely too in-the-weeds to be scavengable.

I'd also call out decision tables as a particular technique that's extractable. You can teach decision tables in five minutes to a completely nontechnical person.

As I said, short newsletters for the rest of this month. Have a great week!


Things I'm curious about

  • Logic Programming: What ideas are scavengable from LP? I know that pattern matching originated in LP and found its way to FP from there, but that's associated with FP more now.
  • Nontechnical People: What's a good word for them? "Nondev" seems like it leaves out data scientists, DBAs, etc. "Normie" and "Muggle" are too offensive. "Layfolk" just sounds weird.
  • TLA+ workshops: Would people be interested in an "intermediate" TLA+ workshop, for people who've already written production specs but want to get better? I'm thinking half a day, 6-8 people, with people bringing up what they want to learn about in advance. If there's enough people who'd want this I can plan something out.

  1. Domain-Driven Design. 

  2. This is one big reason my attempt to write a book on predicate logic has been in a long stall. 

  3. One idea that mathematicians scavenged from APLs is the Iverson bracket: [P] = P ? 1 : 0

"Integration tests" are just vibes

1 May 2024 at 15:52

New blog post! Software Friction is about how all the small issues we run into developing software cause us to miss deadlines, and how we can address some of them. Patreon here.

"Integration tests" are just vibes

You should write more unit tests than integration tests. You should write more integration tests than unit tests. "Integrated tests" are a scam.

But what exactly is an integration test? Is it tests that cross module boundaries? Is it tests that depend on several pieces of behavior, even in one module? Is it anything that tests "emergence"1? Anything with side effects? If get_active_users() runs a database query, are all tests against it integration tests?

No wonder both Twitter and Google have (at one point) claimed that "integration test" wasn't a useful term. There are just too many different things people mean "integration".2

Even if we can't define integration test, we still have recognize them. There are tests that are bigger than our "unit tests" and smaller than an end-to-end test, and people have strong opinions about writing more or less of them. We can have a sense of something being an "integration test" even if we can't strictly define it.

I think it's more useful to think of integration tests in terms of family resemblances. There are certain measurable properties of tests, and integration tests measure relatively different than a unit test. One quality is test speed: unit tests are "fast", integration tests are "slow". This does not mean all integration tests are slower than any unit test (unit property tests can be very slow). But slow tests are more likely to be integration tests than unit tests.

"Integration tests" are a vibe we get based on how we perceive a test's properties. Some of these properties are:

  1. The number of lines of production code executed as part of the test.

  2. Number of boundaries crossed, module/object or otherwise.

  3. Number of unit tests you can decompose the test into.

  4. Number of external services required to run the test.

  5. Amount of engineering effort to write/run the tests. Does it need separate files, a separate language, setting up lots of docker instances?

  6. Nonlocality of discovered bugs: if the test fails, how many lines you have to look at to find the problem.

  7. Likelihood a test failure is due to "emergence": things are correct in isolation but interact in a problematic way.

The more of these have high values, the more likely a test is to be an integration test.

Miscellaneous Thoughts

A test that hits the database (external service) can still be a unit test if it runs quickly and covers a small amount of code. Tests that make HTTP calls are less likely to be fast, so more likely to be integration tests.

Mocks and pact-style testing push some of these values down, by reducing boundaries crossed and lines of code executed at the cost of increased test infrastructure. Contracts and assertions increase bug locality, since the test fails the moment a pre/postcondition is broken instead of when the test completes.

If you push the values high enough, do you get end-to-end tests instead of integration tests? Or are E2E tests fundamentally different from integration tests in some way?

Is it possible for a unit test to uncover emergence, or is emergence a sufficient condition to say that something is definitely an integration test?


Things I'm curious about

Gonna make this a weekly thing maybe (or whenever I have enough new stuff to ask about).

  • Sokoban. Sokoban is "easier" than Chess or GO (PSPACE vs EXPTIME), but Sokoban AIs regularly fail at simple puzzles that humans can do (ref ex1 ex2). What makes Sokoban "so much harder" for AIs? I don't think it's just a lack of interest because I've found Google and Deepmind papers on solving sokoban (and not making much progress).
  • NFCs: I impulse-bought a bunch of NFC chips. I'm already using them to open train trackers, what else can I do with them? I'm not a mobile dev but can whip up Taskers and server code pretty quickly.

If you know something about either of these, feel free to share! Or write about them publicly and share the link with me. You can just reply directly to this email.


  1. Geeze I can't believe I tried to make "automanual" a thing 

  2. "Unit test" is also fuzzy but we at least agree that add(2, 3) == 5 is a unit test. 

"Testing can show the presence of bugs but not the absence"

23 April 2024 at 20:19

Program testing can be used to show the presence of bugs, but never to show their absence! — Edgar Dijkstra, Notes on Structured Programming

Dijkstra was famous for his spicy quips; he'd feel right at home on tech social media. He said things he knows aren't absolutely true but will get people listening to him. In the same essay he discusses how testing could, in specific cases, show the absence of bugs:

  • If a function has a finite set of inputs, like floor(float), you could prove correctness by simply testing all possible inputs.
  • Even if a function has infinite inputs, tests prove the absence of bugs for tested inputs. You could say "date_of_easter works for any date within 10,000 years of today" and have "proven the absence of bugs" for the subset of dates people care about.1
  • You can use testing to complete a proof. For example, you can divide inputs into "equivalence classes", prove that if the function is correct for one value it will also be correct for all values in that class, and then test one value from each class.

In the majority case, however, testing cannot show the absence of bugs. Testing f(16) does not give you any guarantees about all x. f(x).

There are two possible takeaways. The reasonable one is that we shouldn't be blindly confident in testing, should use a variety of approaches as part of defense in depth, and always be careful about what we claim. The unreasonable takeaway is that testing is an inferior verification method, and we should use more "rigorous" approaches instead of testing.

It's unreasonable because methods that show the existence of bugs are more robust than methods that show the absence of them, and this makes them more useful more of the time.

Comparing tests and proof

Let's say you have a pseudocode function like this:

fun elem_before_x(s: int[], x: int) returns (r: int | NULL) {
    let i: int = index(s, x);
    if (i == NULL) {
        return NULL;
    } else return s[i-1];
}

Property:
 if r == NULL then x notin s
 else some i in 0..<len(s): s[i] == r && s[i+1] == x

The bug here is that we're out-of-bounds if x is the first element of the array. Testing will find this bug if you remember to test that case. Otherwise, we'd have a bug and a passing test suite. Tests can't verify the absence of bugs.

Now imagine we do find it and fix it:

else if (i == 0) {
        return s.last;
    } 

Modifying the property is left as an exercise to the reader. Once again, the passing test suite means we haven't verified the absence of the bug. Now, how does formal verification do on the same problem? I wrote a Dafny version of the same problem, here, both with and without the fix. Looking at v1, we see it finds the s[0] error:

>   var v := index.Extract();
>   return Some(s[v-1]);

(19,16): Error: index out of range    
   |
19 |     return Some(s[v-1]);

This error also goes away in v2: the bug is proved to be absent. But v2 adds a new error:

> ensures if r.None? then x !in s
>           else if s[|s|-1] == r.value then s[0] == x
>           else s[IndexOf(s, r.value)+1] == x

(7,20): Error: index out of range
  |
7 |             else if s[|s|-1] == r.value then s[0] == x
  |                      ^^^^^^^

When would s[|s|-1] be an index out of range? When the list is empty. But we know that's not possible: if we return a non-null value then the list must not be empty, because we found x in it! But the prover doesn't know that, it can't infer the relationship between the returned value and what we know about the list.2 We have to annotate our code with proof hints for it to recognize that if we return a number, the list is nonempty.

In short: the prover definitely finds the out-of-bounds bug, but also raises a spurious "empty list" bug. The test might not find the former but also won't raise the latter. Tests can have false negatives but not false positives, while provers can have false positives but not false negatives.

(I'd say "formal verification can prove the absence of a bug but never the presence", but that's not true. Dafny can sometimes give you a concrete counterexample to a property, and type systems detect true positives like "you passed an array into an Int -> Int function" all the time.)

Knowing there "might" be a bug is not as useful as knowing there is one. Empirically speaking people mostly ignore compiler warnings and plenty of retrospectives find people ignore "might possibly be a bug" in even mission-critical software. Formal verification often becomes an "all or nothing" deal, either you totally prove the absence of bugs in programs or you get no benefits at all.

And that sucks! It takes a lot of effort to prove something correct, magnitudes more than running some tests. And that's why I consider testing "more robust": even a little bit of testing may find a true bug, while a little bit of verification doesn't get you much.

Correctness is not a thing

I'll end this by getting on my soapbox. Here's something that's really important to understand about "proving the absence of bugs":

You do not prove something's correct. You prove conformance to specification.

It's possible for code to conform to the specification and still have bugs. Two common cases are:

  1. The spec is wrong or incomplete, like v2.dfy is in the above example. Spoiler in the footnote.3
  2. The spec is based on an assumption that's not true in practice, like "no other program is writing to the same file."

This is also important because it puts the specification front-and-center. You can write tests without having a specification in mind, but you need a spec to do any kind of formal method. Doing FM trains you to think about specs, which in turn helps you write better tests.


Things I am interested in

  1. Crossovers from fields besides classical engineering, who worked professionally as something else before switching to software. What ideas from your old job as a journalist/librarian/social worker/acrobat could be useful in software, and vice versa?
  2. Wargaming in software. Have you designed or participated in a devops/security/system design/whatever simulated game? Was it helpful?

If you have insights into either of these, feel free to share! Or write about them publicly and share the link with me. You can just reply directly to this email.


  1. Though you could argue this isn't sufficient for security. If easter(-1e8) causes a memory overflow then that's a bug even if a normal user won't ever call it. 

  2. We're lucky that the Dafny standard library defines a postcondition for index functions saying the output is less than the length of the list. Otherwise the prover would raise a potential out-of-bounds error on s[IndexOf(s, x)]

  3. s[|s|-1] == r.value then s[0] == x isn't true for the input [a, r, x, r]. The good news is that Dafny refuses to verify the code because it didn't match the spec. I only realized this when I was error-checking this piece. (This doesn't refute the point I made above about false positives: the false positive was that s[|s|-1] was out of bounds!) 

What makes concurrency so hard?

16 April 2024 at 20:28

A lot of my formal specification projects involve concurrent or distributed system. That's in the sweet spot of "difficult to get right" and "severe costs to getting it wrong" that leads to people spending time and money on writing specifications. Given its relevance to my job, I spend an awful lot of time thinking about the nature of concurrency.

As the old joke goes, concurrency one of the two hardest things in computer science. There are lots of "accidental" reasons why: it's hard to test, it's not composable, bugs can stay latent for a long time, etc. Is there anything that makes it essentially hard? Something that makes concurrent software, by its very nature, more difficult to write than synchronous software?

The reason I hear most often is that humans think linearly, not concurrently, so are ill-equipped to reason about race conditions. I disagree: in my experience, humans are very good at concurrent reasoning. We do concurrent reasoning every time we drive a car!

More generally, some studies find that if you frame concurrent systems in human terms ("meatspace modeling"), people get quite good at finding the race conditions. So while concurrency might be difficult to reason about, I don't think it's because of a fault in our brains.

In my opinion, a better basis is state space explosion. Concurrency is hard because concurrent systems can be in a lot of different possible states, and the number of states grows much faster than anyone is prepared for.

Behaviors, Interleavings, and Nondeterminism

Take agents1 {1, 2, … n}, which each executes a linear sequence of steps. Different agents may have different algorithms. Think writing or reading from a queue, incrementing a counter, anything like that. The first process takes p1 atomic steps to complete, the second p2, etc. Agents complete their programs strictly linearly, but another process can interleave after every atomic step. If we have algorithms A1A2 and B1B2, they could execute as A1B1A2B2 or A1B1B2A2, but not A1B2B1A2.

Under those conditions, here's an equation for how many possible orderings of execution (behaviors) can happen:

(p1+p2+...)!
------------
p1!*p2!*...

If we have three agents all executing the same 2-step algorithm, that's 6!/8 = 90 distinct behaviors. Each step in the sequence can potentially lead to a distinct state, so there's 6*90=540 maximum distinct states (MDS). Any one of those can potentially be the "buggy" one.

In most cases, the actual state space will be significantly smaller than the MDS, as different behaviors will cross the same states. Counterbalancing this is just how fast this formula grows. Three 3-step agents gives us 1700 possible behaviors (15K MDS), four 2-step agents instead have 2500 (20K MDS). And this is all without any kind of nondeterminism! If one step in one of the agents can nondeterministically do one of three things (send message M₁, send message M₂, crash), that triples the number of behaviors and the MDS.

It's pretty common for complex concurrent systems to have millions or tens of millions of states. Even the toy model I use in this article on model optimization has about 8 million distinct states (though with some work you can get it much lower). I mostly think about state spaces in terms of performance because large state spaces take a lot longer to model-check. But it's also why concurrency is essentially hard to reason about. If my theoretical MDS is two million states, my practical state space is just 1% that size, and my human brain can reason through 99.9% of the remaining states… that still leaves 20 edge cases I missed.

Shrinking the State Space

Here's a heuristic I use a lot:

All means of making concurrency 'easier' are concerned first and foremost with managing the state space.

It's not 100% true (no heuristic is) but it's like 60% true, and that's good enough. State spaces grow quickly and bigger state spaces cause problems. If we want to make maintainable concurrent systems, we need to start by shrinking the space.

Like look at threads. Threads share memory, and the thread scheduler has a lot of freedom to suspend threads. So you have lots of steps (potentially one per line of code)2 and any interleaving can lead to a distinct state. I can use programming constructs like mutexes and barriers to "prune" the state space and give me the behaviors I want, but given how big the state space can be, I have to do a lot of pruning to get the right behaviors. I can make mistakes in implementation, "misshape" the space (like by adding a deadlock), or not notice a buggy state I need to remove. Threads are very error prone.

I could instead switch to memory-isolated processes. The scheduler can still schedule whatever interleavings it wants but the processes can't muck with each other's internal memory. Internally a process is still executing A1A2A3A4, but if the only steps that involve external resources (or interprocess communication) are A2 and A4, then we only need to think about how A2A4 affects the state space. Three four-step agents have 35k interleavings, three two-step agents have only 90. That's a big improvement!

What else can we do? Low-level atomic instructions do more in a single step, so there's no room for interleaving. Database transactions take a lot of physical time but represent only one logical time step. Data mutations create new steps, which immutable data structures avoid by definition.

Languages have constructs to better prune the resulting state space: go's channels, promises/futures/async-await, nurseries, etc. I think you can also treat promises as a way of forcing "noninterleaving": wait until a future is ready to execute in full (or to the next yield point) and before execution. Please don't quote me on this.

I think CRDTs reduce state space by permitting interleavings, but arranging things so that external changes are commutative: A1B1 and B1A1 lead to the same final result, so there are not distinct states.

Again, this is all a very rough heuristic.

Limits

To a first-order approximation, smaller state space == good. But this doesn't account for the topology of the space: some spaces are gnarlier than others. One that has lots of different cycles will be harder to work with than one that's acyclic.3 Different forms of nondeterminism also matter: "the agent continues or restarts" leads to more complex behavior than "the writer sends message M₁ or M₂." These are both places where "humans are bad at reasoning about concurrency" appears again. We often work through concurrency by reduces groups of states to "state equivalence classes" that all have the same properties. complicated state spaces have more equivalence classes to work through.

(Another way "humans are bad at reasoning about concurrency" can be a real thing: we might not notice that something is nonatomic.)

Some high-level paradigms can lead to particular state space topologies that have fewer interleavings or ones that have more equivalent states. I've heard people claim that fork-join and pipe-filter are especially easy for humans to reason about, which I take to mean "doesn't lead to a gnarly state space". Maybe also event-loops? Where does the actor model fit into all of this?

Another limit: there's a difference between buggy states and buggy behaviors. Some behaviors can go entirely through safe states but still cause an bug like "never reaches consistency". This is called a "liveness bug", and I talk more about them here. Liveness bugs are much harder to reason about.

Okay, that's enough philosophical rambling about concurrency. Concurrency is hard, don't feel bad if you struggle with it, it's not you, it's combinatorics.


Video Appearance

I was on David Giard's Technology and Friends talking about TLA+. Check it out here!


  1. An agent is any concurrent actor, like a thread, a process, a future, a human, etc. Concurrency literature uses "processes" a lot but that clashes with OS processes. 

  2. Or more! If writes are nonatomic, you can get data races, where two threads write to different bits in a byte at the same time. Then a variable assignment can be multiple steps in one line. 

  3. Is there any research correlating unusual subgraphs in state spaces with probable bugs? 

Some notes on for loops

10 April 2024 at 17:08

New Blogpost

Don't let Alloy facts make your specs a fiction, about formal methods practices (and a lot of Alloy). Patreon link here.

Some notes on for loops

I sometimes like to sharpen the axe by looking at a basic programming concept and seeing what I can pull out. In this case, for loops.

There are two basic forms of for loops:

for(i = 0; i < 10; i++)

for val in ["a", "b", "c"]:

The second is usually called a foreach loop, and I'll call the former "forstep", after the ALGOL-60 construct. Almost all languages have one of the two, some have both. While forstep loops show up early, foreach loops only appear in PLs ~20 years later.

Looking at this, here's some things I see:

Construct reduction

Given a foreach loop, I can syntactically translate it into a forstep loop:

for val in arr:
    # ...

-->

for(i = 0; i < arr.len; i++) {
    val = arr[i];
    // ...
}

There's a lot of terms we use to express this property. We can say that foreach is homomorphic or reducible to forstep, or that foreach can be constructed out of forstep, or others I can't think of rn.

If we could also construct forstep from foreach, we could say the two are equivalent or isomorphic. In theory we can do this if we had a range operator:

for(i = 0; i < 10; i += 2) 

--> 

for i in range(0, 10, step=2);

Most forstep implementations, though, can do things like modify the iterator in the middle of the loop:

for(i = 0; i < 10; i++) {
    // stuff
    if(cond) i = 0;
    // stuff
}

You can't convert this to a foreach loop: it's an "irreducible" feature.

Most people consider modifying the index bad practice, though. 99% of the use cases of forstep loops are constructible with foreach loops, so the two are morally equivalent. I'll treat them as roughly interchangeable going forward.

Primitives and Expressivity

For loops are reducible to while loops:

for(i = 0; i < 10; i += 1) 

-->

i = 0;
while i < 10 {
  // ...
  i++;
}

But while loops are not reducible to for loops:

while(user_cmd != "quit")

-->

???

Again, lots of terms to cover this: while loops are more "primitive", more "powerful", more "low-level", more "expressive", etc. If you have while loops, you don't need for loops, and if you have goto, you don't need either.

But you shouldn't use a goto when a while loop works or a while loop where a for loop works. This is a generalization of the rule of least power (RLP): given a problem, use the tool that is most specific to the problem. If you can iterate through a list with a foreach loop, don't use a while instead!

There's two reasons for this. One is that it's more communicative to other programmers. If they see a foreach loop, they know it's going to be used for some kind of iteration, while that's not clear with the while loop. Two is the capability/tractability tradeoff: the more limited a construct is, the easier it is to understand and work with. A foreach loop can do less than a while loop, but it's much easier to guarantee that it processes each item only once.

Programming languages evolve to replace the all-powerful "low-level" constructs with more limited constructs that reduce to the low-level ones. It's why you get things like interfaces over inheritance.

The corollary of this is that if you use a powerful construct, it should be because the limited one can't do what you want. Don't force an infinite for-loop!

Extensions to for loops

For loops reduce to while loops, while loops reduce to goto, everything reduces to goto. So there are some kinds of looping goto can do that for and while cannot.

They get a little closer to feature-parity by adding flow-control inside the loop: continue for advancing the loop early, break for terminating it. I didn't look into when or how these were added, but I'd guess they come from looking at what goto can do that loops could not, and adding features as needed.

"Imperative programming" is just for loops

I think one of the most basic functions of a programming language is to iterate through an array. So much so that you can distinguish different programming paradigms by how they iterate.

  • Imperative programming: for loops
  • OOP: a foreach method on iterable objects[^python]
  • Functional programming: recursion, reduce (we'll talk about this later)
  • Logic programming: I'm less familiar with this, but I think it's recursion + pattern matching (which ML and Haskell adopted)
  • Array programming: all operations take arrays by default.

I think this explains why arrays are universal while dictionaries are relatively new: there's no obvious basic operation that covers what you do with dicts to the degree that for covers what arrays do.

(Question: do list comprehensions lead to a new paradigm? How do you describe "iteration" in spreadsheet languages?)

Functional replacements

When people talk about replacements for foreach loops in FP, they usually bring up map and filter. But these are not equivalent. Rather map and filter are "preferred" to loops because they are more limited. It's another instance of RLP.

As far as I can tell, reduce is equivalent to foreach. Morally, anyway. This is interesting because reduce is considered part of the "big three" (along with map and filter), but it's also the one that's the "odd one out". It's the hardest for people to understand, and Haskell even has three versions of it. Is it confusing because it's so powerful?

Map and filter are composable in a way that foreach loops are not, but they're also weirdly less "composable" in one very specific way? Consider something of the form

map f (filter p arr))

-->

out = []
for x in arr:
  if p(x):
    out.append(f(x))

The map/filter iterates over the list twice, while the foreach loop iterates only once. I know Haskell can fuse maps together as an optimization but I don't think you safely fuse arbitrary map/filters? I dunno.

You can fuse map and filter with a reduce:

reduce ((out, x) => 
        if p(x) 
        then out ++ f(x)
        else out
       ), [], arr

This gets really awkward though.

I wonder if "reduce is equivalent to for loops" plus "reduce is awkward for fusion" somehow inspired clojure transducers.

I'll be writing this forever

Things I could have looked at in more detail but ran out of time and energy:

  • "Iterators" as a general concept.
  • Loop invariants and variants. This is really relevant to formal methods in particular.
  • Specific implementations of for loops in languages. ALGOL's loop expression-lists, Python's for else, Raku's phasers, javascript's for in/for of distinction, Ada and Chapel loop parallelization constructs... If I keep going I'll get something as long as this.

Following any of those leads would get us to different insights, as opposed to main thing that ended up catching my interest with these notes, which is how the rule of least power applies to language constructs. I should probably write a blog post about it!

Anyway I think this kind of exercise, looking at something everybody takes for granted and try to learn things from it, is really valuable for understanding software better. I don't know how interesting it is to read about, though. Lemme know if you think you'd like to see this for other stuff, or just share the essays that are downstream of the practice (like writing about RLP).


Podcast appearance: are we engineers?

I was on a podcast, talking about the crossover project and what software engineers can learn from traditional engineering. Check it out here!

Why do regexes use `$` and `^` as line anchors?

25 March 2024 at 16:40

Next week is April Cools! A bunch of tech bloggers will be writing about a bunch of non-tech topics. If you've got a blog come join us! You don't need to drive yourself crazy with a 3000-word hell essay, just write something fun and genuine and out of character for you.

But I am writing a 3000-word hell essay, so I'll keep this one short. Last week I fell into a bit of a rabbit hole: why do regular expressions use $ and ^ as line anchors?1

This talk brings up that they first appeared in Ken Thompson's port of the QED text editor. In his manual he writes:

b) "^" is a regular expression which matches character at the beginning of a line.

c) "$" is a regular expression which matches character before the character <nl> (usually at the end of a line)

QED was the precursor to ed, which was instrumental in popularizing regexes, so a lot of its design choices stuck.

Okay, but then why did Ken Thompson choose those characters?

I'll sideline ^ for now and focus on $. The original QED editor didn't have regular expressions. Its authors (Butler Lampson and Peter Deutsch) wrote an introduction for the ACM. In it they write:

Two minor devices offer additional convenience. The character "." refers to the current line and the character "$" to the last line in the buffer.

So $ already meant "the end of the buffer", and Ken adapted it to mean "the end of the line" in regexes.

Okay, but then why did Deutsch and Lampson use $ for "end of buffer"?

Things get tenuous

The QED paper mentions they wrote it for the SDS-930 mainframe. Wikipedia claims (without references2) that the SDS-930 used a Teletype Model 35 as input devices. The only information I can find about the model 35 is this sales brochure, which has a blurry picture of the keyboard:

Blurry picture of a keyboard

I squinted at it really hard and saw that it's missing the []{}\|^_@~ symbols. Of the remaining symbols, $ is by far the most "useless": up until programming it exclusively meant "dollars", whereas even something like # meant three different things. But also, $ is so important in business that every typewriter has one. So it's a natural pick as the "spare symbol".

Yes this is really tenuous and I'm not happy with it, but it's the best answer I got.

If we're willing to stick with it, we can also use it to explain why Ken chose ^ to mean "beginning of line". ^ isn't used in American English, and the only reason QED wasn't using it was because it wasn't on the Teletype Model 35. But Ken's keyboard did have ^, even when it wasn't standardized at the time, so he was able to use it.

(Why did it have ^? My best guess is that's because ASCII-67 included it as a diacritic and keyboards were just starting to include all of the ASCII characters. The Teletype 35 brochure says "it follows ASCII" but didn't include many of the symbols, just uses the encoding format.)

So there you have it, an explanation for the regex anchors that kinda makes sense. Remember, April Cools next week!


  1. I had this question as part of a larger rabbit hole on the history of regular expressions, which might become a full blog post at some point in May. 

  2. I have a pathological hatred of secondary sources. 

What if the spec doesn't match the code?

19 March 2024 at 16:23

Whenever I talk about formal methods, I get the same question:

Can I use the spec to generate my code?

People are worried about two things. One, that they'll make a mistake implementing the specification and have bugs. Two, that over time the implementation will "drift" and people won't update the spec, leading to unexpected design issues.

These are valid fears, and both are things that happen when writing specifications! Which is why it extra sucks that my answer is usually "you can't". This newsletter will be a quick explanation as to why, plus why it's still a net win to write a spec, and an alternate trick you can do instead to keep things in sync.

Synchronization is too expensive

There's two ways we can keep code in sync with a spec. The first way is program extraction, where you take a spec and generate a complex program from it. Lots of theorem provers, like Coq and Agda, have this functionality. The second way is refinement, where you write code independently and then prove it somehow matches the spec. This is what the seL4 people did with their C code.

Extractable/refineable spec are much, much harder to write than an "unsynced" spec because you lose a level of abstraction. Specifications can exist at a higher level of abstraction than a program can, which is why they can be so much more concise and focused than a program can. For example, here's a quick PlusCal snippet1 for a reader pulling a value off a queue and adding it to that reader's total:

await Nonempty(queue);
with r \in Readers do
     total[r] := total[r] + Head(queue);
     queue := Tail(queue);
end with;

What would that look like in a program? You'd probably have code for setting up the readers, code for reading from the queue, code for persisting the value of total somewhere. If it's in another system like Postgres or redis, you'd have a client library for that, too. And around all of that you'd have logging, heartbeats, event hooks, a whole bunch of orthogonal concerns.

You might not even want to implement the whole spec! This is the machine vs world dichotomy: the spec can encode both the system you're designing and also the environment around that system. Let's say you're using AWS SQS for the queue, so there's a chance you can't connect to it, and a chance it duplicates a message. You can't implement either of those things in your system, only the code that detects and reacts to them. But adding those are par for the course in specification. It might look like this:

await Nonempty(queue);
with r \in Readers do
    either
        \* unreachable!
        status[r] := "error";
    or
        \* duplicate!
        total[r] := total[r] + Head(queue);
    or
        \* happy path
        total[r] := total[r] + Head(queue);
        queue := Tail(queue);
    end either;
end with;

To synchronize spec and code, you need to bring the spec closer to the level of code, which means losing a lot of abstraction power. It's definitely possible, but considerably more expensive. Synchronization is not affordable for most companies.

But specs are still worth it

Even if you can't keep the spec in sync, it's still a net benefit. Consider what happens if you have a bug in production:

  • With a spec: either you didn't implement the design correctly, or an assumption in the design ("the queue is perfectly reliable") was invalid
  • Without a spec: you didn't implement the design correctly, an assumption in the design was invalid, or the design itself has a flaw.

That last one is the most dangerous and the most expensive to fix — this talk estimated that one design flaw cost their team a year of rework. They're also really hard to correctly diagnose without a spec. I've often seen people think a design bug was actually an implementation bug. They'd "fix" the bug and it'd go away for just long enough for everybody to forget about it before suddenly resurfacing.

And then you gotta make sure that the new design isn't buggy, either.

That's why formal methods are a net win, even without syncing with code. With a verified design you can build systems faster and you can fix bugs faster, because you're confident that the bugs you got are "only" implementation errors.2

We're working on it

A confession: the "this is impossible" section was glazing over a lot of interesting work.

First of all, cheaply generating code from specs might not be viable in general, but it's possible for specific subsets of specification. This is mostly the realm of research projects like PGo, which converts (a variant of) PlusCal to Go code. You can see some examples here. I don't think it's going to be production ready for a long time, but at least it's being worked on.

A more exciting approach is generating tests from specs. I know a few companies who've done this, with the most public example being MongoDB. They discuss their approach in eXtreme modeling in practice:

We wrote a new TLA+ specification for this purpose and used the state space explored by the TLC model checker to generate C++ and Golang test cases for aspects of the OT algorithm. These test cases can be automatically generated and run after any change to the model, and ensure that both the C++ and Golang implementations make appropriate changes to maintain equivalence.

More specifically, they took a TLA+ spec and dumped the state graph, then decomposed it into a set of linear behaviors and converted those into about 5,000 tests. This works better than code generation because tests are, in a strange way, "more abstract" than code. Lots of implementations pass the same test suite. So it's less of a jump.

The TLA+ model checker has a bunch of features to make test generation easier: simulation mode lets you grab random traces and ALIAS can fine-tune the trace outputs. There's also an IO extension for calling subprocesses from model checking, which can be used to drive a test harness.

Two important caveats. One, the MongoDB team wrote a bunch of custom code to make this work; there's no off-the-shelf solution for turning specs into tests. Second, they were specifying a low-level, deterministic program, so were able to keep the TLA+ relatively close to the code. Writing a test harness for a distributed system is a lot harder. It's still a promising avenue!


Comments Section

I got a lot of responses to my graph types essay. I compiled some of the most interesting ones into a "Comments Section", here. There's some cool stuff about datalog and Rubik's Cubes there!


  1. PlusCal is a DSL that compiles to pure TLA+. I'm using it here because the equivalent TLA+ constructs are a little harder for outsiders to read. 

  2. One weak point in this is the "An assumption in the design was wrong" problem, but generally it's much easier to adapt a working design that fails under one broken assumption than an ad hoc design that might not have worked in the first place. 

Be Suspicious of Success

16 October 2024 at 15:08

From Leslie Lamport's Specifying Systems:

You should be suspicious if [the model checker] does not find a violation of a liveness property... you should also be suspicious if [it] finds no errors when checking safety properties.

This is specifically in the context of model-checking a formal specification, but it's a widely applicable software principle. It's not enough for a program to work, it has to work for the right reasons. Code working for the wrong reasons is code that's going to break when you least expect it. And since "correct for right reasons" is a much narrower target than "correct for any possible reason", we can't assume our first success is actually our intended success.

Hence, BSOS: Be Suspicious of Success.

Some useful BSOS practices

The standard way of dealing with BSOS is verification. Tests, static checks, model checking, etc. We get more confident in our code if our verifications succeed. But then we also have to be suspicious of that success, too! How do I know whether my tests are passing because they're properly testing correct code or because they're failing to test incorrect code?

This is why test-driven development gurus tell people to write a failing test first. Then at least we know the tests are doing something (even if they still might not be testing what they want).

The other limit of verification is that it can't tell us why something succeeds. Mainstream verification methods are good at explaining why things fail— expected vs actual test output, type mismatches, specification error traces. Success isn't as "information-rich" as failure. How do you distinguish a faithful implementation of is_collatz_counterexample from return false?

A broader technique I follow is make it work, make it break. If code is working for the right reasons, I should be able to predict how to break it. This can be either a change in the runtime (this will livelock if we 10x the number of connections), or a change to the code itself (commenting out this line will cause property X to fail). 1 If the code still works even after the change, my model of the code is wrong and it was succeeding for the wrong reasons.

Happy and Sad Paths

A related topic (possibly subset?) is "happy and sad paths". The happy path of your code is the behavior when everything's going right: correct inputs, preconditions are satisfied, the data sources are present, etc. The sad path is all of the code that handles things going wrong. Retry mechanisms, insufficient user authority, database constraint violation, etc. In most software, the code supporting the sad paths dwarfs the code in the happy path.

BSOS says that I can't just show code works in the happy path, I also need to check it works in the sad path.

BSOS also says that I have to be suspicious when the sad path works properly, too.

Say I add a retry mechanism to my code to handle the failure mode of timeouts. I test the code and it works. Did the retry code actually run? Did it run regardless of the original response? Is it really doing exponential backoff? Will stop after the maximum retry limit? Is the sad path code after the maximum retry limit working properly?

One paper found that 35% of catastrophic distributed system failures were caused by "trivial mistakes in error handlers" (pg 9). These were in mature, battle-hardened programs. Be suspicious of success. Be more suspicious of sad path success.


Blog Rec

This week's blog rec is Red Blob Games!2 While primarily about computer game programming, the meat of the content is beautiful, interactive guides to general CS algorithms. Some highlights:

(I don't think his rss feed covers new interactive articles, only the blog specifically.)


  1. Jimmy Koppel once proposed that just as code has properties, code variations have superproperties. For example, "no modification to the codebase causes us to use a greater number of deprecated APIs." 

  2. Okay, it's more an article site, because there's also a Red Blob blog (which covers a lot of neat stuff, too). Maybe I should just rename this section to "site rec". 

How to convince engineers that formal methods is cool

8 October 2024 at 16:18

Sorry there was no newsletter last week! I got COVID. Still got it, which is why this one's also short.

Logic for Programmers v0.4

Now available! This version adds a chapter on TLA+, significantly expands the constraint solver chapter, and adds a "planner programming" section to the Logic Programming chapter. You can see the full release notes on the book page.

How to convince engineers that formal methods is cool

I have an open email for answering questions about formal methods,1 and one of the most common questions I get is "how do I convince my coworkers that this is worth doing?" usually the context is the reader is really into the idea of FM but their coworkers don't know it exists. The goal of the asker is to both introduce FM and persuade them that FM's useful.

In my experience as a consultant and advocate, I've found that there's only two consistently-effective ways to successfully pitch FM:

  1. Use FM to find an existing bug in a work system
  2. Show how FM finds a historical bug that's already been fixed.

Why this works

There's two main objections to FM that we need to address. The first is that FM is too academic and doesn't provide a tangible, practical benefit. The second is that FM is too hard; only PhDs and rocket scientists can economically use it. (Showing use cases from AWS et al aren't broadly persuasive because skeptics don't have any insight into how AWS functions.) Finding an existing bug hits both: it helped the team with a real problem, and it was done by a mere mortal.

Demonstrating FM on a historical bug isn't as effective: it only shows that formal methods could have helped, not that it actually does help. But people will usually remember the misery of debugging that problem. Bug war stories are popular for a reason!

Making historical bugs persuasive

So "live bug" is a stronger rec, but "historical bug" tends to be easier to show. This is because you know what you're looking for. It's easier to write a high-level spec on a system you already know, and show it finds a bug you already know about.

The trick to make it look convincing is to make the spec and bug as "natural" as possible. You can't make it seem like FM only found the bug because you had foreknowledge of what it was— then the whole exercise is too contrived. People will already know you had foreknowledge, of course, and are factoring that into their observations. You want to make the case that the spec you're writing is clear and obvious enough that an "ignorant" person could have written it. That means nothing contrived or suspicious.

This is a bit of a fuzzy definition, more a vibe than anything. Ask yourself "does this spec look like something that was tailor-made around this bug, or does it find the bug as a byproduct of being a regular spec?"

A good example of a "natural" spec is the bounded queue problem. It's a straight translation of some Java code with no properties besides deadlock checking. Usually you'll be at a higher level of abstraction, though.


Blog rec: arg min

This is a new section I want to try for a bit: recommending tech(/-adjacent) blogs that I like. This first one is going to be a bit niche: arg min is writing up lecture notes on "convex optimization". It's a cool look into the theory behind constraint solving. I don't understand most of the math but the prose is pretty approachable. Couple of highlights:

The blogger also talks about some other topics but I haven't read those posts much.


  1. As always, talking primarily about formal specification of systems (TLA+/Alloy/Spin), not formal verification of code (Dafny/SPARK/Agda). I talk about the differences a bit here (but I really need to write a more focused piece). 

Refactoring Invariants

24 September 2024 at 20:06

(Feeling a little sick so this one will be short.)

I'm often asked by clients to review their (usually TLA+) formal specifications. These specs are generally slower and more convoluted than an expert would write. I want to fix them up without changing the overall behavior of the spec or introducing subtle bugs.

To do this, I use a rather lovely feature of TLA+. Say I see a 100-line Foo action that I think I can refactor down to 20 lines. I'll first write a refactored version as a separate action NewFoo, then I run the model checker with the property

RefactorProp ==
    [][Foo <=> NewFoo]_vars

That's an intimidating nest of symbols but all it's saying is that every Foo step must also be a NewFoo step. If the refactor ever does something different from the original action, the model-checker will report the exact behavior and transition it fails for. Conversely, if the model checker passes, I can safely assume they have identical behaviors.

This is a refactoring invariant:1 the old and new versions of functions have identical behavior. Refactoring invariants are superbly useful in formal specification. Software devs spend enough time refactoring that they'd be useful for coding, too.

Alas, refactoring invariants are a little harder to express in code. In TLA+ we're working with bounded state spaces, so the model checker can check the invariant for every possible state. Even a simple program can have an unbounded state space via an infinite number of possible function inputs.

(Also formal specifications are "pure" simulations while programs have side effects.)

The "normal" way to verify a program refactoring is to start out with a huge suite of oracle tests. This should catch a bad refactor via failing tests. The downside is that you might not have the test suite in the first place, or not one that covers your particular refactoring. Second, even if the test suite does, it only indirectly tests the invariant. It catches the refactoring error as a consequence of testing other stuff. What if we want to directly test the refactoring invariant?

Two ways of doing this

One: by pulling in formal methods. Ray Myers has a neat video on formally proving a refactoring is correct. That one's in the niche language ACL2, but he's also got one on refactoring C. You might not even to prove the refactoring correct, you could probably get away with using an SMT solver to find counterexamples.

Two: by using property-based testing. Generate random inputs, pass them to both functions, and check that the outputs are identical. Using the python Hypothesis library:

from hypothesis import given
import hypothesis.strategies as st

# from the `gilded rose kata`
def update_quality(list[Item]):
    ...

def update_quality_new(list[Item]):
    ...

@given(st.lists(st.builds(Item)))
def test_refactoring(l):
    assert update_quality(l) == update_quality_new(l)

One tricky bit is if the function is part of a long call chain A -> B -> C, and you want to test that refactoring C' doesn't change the behavior of A. You'd have to add a B' that uses C' and then an A' that uses B'. Maybe you could instead create a branch, commit the change the C' in that branch, and then run a cross-branch test against each branch's A.

Impure functions are harder. The test now makes some side effect twice, which could spuriously break the refactoring invariant. You could instead test the changes are the same, or try to get the functions to effect different entities and then compare the updates of each entity. There's no general solution here though, and there might be No Good Way for a particular effectful refactoring.

Behavior-changing rewrites

We can apply similar ideas for rewrites that change behavior. Say we have an API, and v1 returns a list of user names while v2 returns a {version, userids} dict. Then we can find some transformation of v2 into v1, and run the refactoring invariant on that:

def v2_to_v1(v2_resp):
    return [User(id).name for user in v2_resp["userids"]]

@given(some_query_generator)
def test_refactoring(q):
    assert v1(q) == v2_to_v1(v2(q))

Fun fact: v2_to_v1 is a software homomorphism!


  1. Well technically it's an action property since it's on the transitions of states, not the states, but "refactor invariant" gets the idea across better. 

Goodhart's Law in Software Engineering

17 September 2024 at 16:33

Blog Hiatus

You might have noticed I haven't been updating my website. I haven't even looked at any of my drafts for the past three months. All that time is instead going into Logic for Programmers. I'll get back to the site when that's done or in 2025, whichever comes first. Newsletter and Patreon will still get regular updates.

(As a comparison, the book is now 22k words. That's like 11 blog posts!)

Goodhart's Law in Software Engineering

I recently got into an argument with some people about whether small functions were mostly a good idea or always 100% a good idea, and it reminded me a lot about Goodhart's Law:

When a measure becomes a target, it ceases to be a good measure.

The weak version of this is that people have perverse incentives to game the metrics. If your metric is "number of bugs in the bug tracker", people will start spuriously closing bugs just to get the number down.

The strong version of the law is that even 100% honest pursuit of a metric, taken far enough, is harmful to your goals, and this is an inescapable consequence of the difference between metrics and values. We have metrics in the first place because what we actually care about is nonquantifiable. There's some thing we want more of, but we have no way of directly measuring that thing. We can measure something that looks like a rough approximation for our goal. But it's not our goal, and if we replace the metric with the goal, we start taking actions that favor the metric over the goal.

Say we want more reliable software. How do you measure "reliability"? You can't. But you can measure the number of bugs in the bug tracker, because fewer open bugs roughly means more reliability. This is not the same thing. I've seen bugs fixed in ways that made the system less reliable, but not in ways that translated into tracked bugs.

I am a firm believer in the strong version of Goodhart's law. Mostly because of this:

A peacock with its feathers out. The peacock is scremming

What does a peahen look for in a mate? A male with maximum fitness. What's a metric that approximates fitness? How nice the plumage is, because nicer plumage = more calories energy to waste on plumage.1 But that only approximates fitness, and over generations the plumage itself becomes the point at the cost of overall bird fitness. Sexual selection is Goodhart's law in action.

If the blind watchmaker can fall for Goodhart, people can too.

Examples in Engineering

Goodhart's law is a warning for pointy-haired bosses who up with terrible metrics: lines added, feature points done, etc. I'm more interested in how it affects the metrics we set for ourselves that our bosses might never know about.

  • "Test coverage" is a proxy for how thoroughly we've tested our software. It diverges when we need to test lots of properties of the same lines of code, or when our worst bugs are emergent at the integration level.
  • "Cyclomatic complexity" and "function size" are proxies for code legibility. They diverges when we think about global module legibility, not local function legibility. Then too many functions can obscure the code and data flow.
  • Benchmarks are proxies for performant programs, and diverge when improving benchmarks slows down unbenchmarked operations.
  • Amount of time spent pairing/code reviewing/debugging/whatever proxies "being productive".
  • The DORA report is an interesting case, because it claims four metrics2 are proxies to ineffable goals like "elite performance" and employee satisfaction. It also argues that you should minimize commit size to improve the DORA metrics. A proxy of a proxy of a goal!

What can we do about this?

No, I do not know how to avoid a law that can hijack the process of evolution.

The 2023 DORA report suggests readers should avoid Goodhart's law and "assess a team's strength across a wide range of people, processes, and technical capabilities" (pg 10), which is kind of like saying the fix to production bugs is "don't write bugs". It's a guiding principle but not actionable advice that gets to that principle.

They also say "to use a combination of metrics to drive deeper understanding" (ibid), which makes more sense at first. If you have metrics X and Y to approximate goal G, then overoptimizing X might hurt Y, indicating you're getting further from G. In practice I've seen it turn into "we can't improve X because it'll hurt Y and we can't improve Y because it'll hurt X." This could mean we're at the best possible spot for G, but more often it means we're trapped very far from our goal. You could come up with a weighted combination of X and Y, like 0.7X + 0.3Y, but that too is a metric subject to Goodhart.

I guess the best I can do is say "use your best engineering judgement"? Evolution is mindless, people aren't. Again, not an actionable or scalable bit of advice, but as I grow older I keep finding "use your best judgement" is all we can do. Knowledge work is ineffable and irreducible.


  1. This sent me down a rabbit hole; turns out scientists are still debating what exactly the peacock's tail is used for! Is it sexual selection? Adverse signalling? Something else??? 

  2. How soon commits get to production, deployment frequency, percent of deployments that cause errors in production, and mean time to recovery. 

Why Not Comments

10 September 2024 at 19:40

Logic For Programmers v0.3

Now available! It's a light release as I learn more about formatting a nice-looking book. You can see some of the differences between v2 and v3 here.

Why Not Comments

Code is written in a structured machine language, comments are written in an expressive human language. The "human language" bit makes comments more expressive and communicative than code. Code has a limited amount of something like human language contained in identifiers. "Comment the why, not the what" means to push as much information as possible into identifiers. Not all "what" can be embedded like this, but a lot can.

In recent years I see more people arguing that whys do not belong in comments either, that they can be embedded into LongFunctionNames or the names of test cases. Virtually all "self-documenting" codebases add documentation through the addition of identifiers.1

So what's something in the range of human expression that cannot be represented with more code?

Negative information, drawing attention to what's not there. The "why nots" of the system.

A Recent Example

This one comes from Logic for Programmers. For convoluted technical reasons the epub build wasn't translating math notation (\forall) into symbols (). I wrote a script to manually go through and replace tokens in math strings with unicode equivalents. The easiest way to do this is to call string = string.replace(old, new) for each one of the 16 math symbols I need to replace (some math strings have multiple symbols).

This is incredibly inefficient and I could instead do all 16 replacements in a single pass. But that would be a more complicated solution. So I did the simple way with a comment:

Does 16 passes over each string
BUT there are only 25 math strings in the book so far and most are <5 characters.
So it's still fast enough.

You can think of this as a "why I'm using slow code", but you can also think of it as "why not fast code". It's calling attention to something that's not there.

Why the comment

If the slow code isn't causing any problems, why have a comment at all?

Well first of all the code might be a problem later. If a future version of LfP has hundreds of math strings instead of a couple dozen then this build step will bottleneck the whole build. Good to lay a signpost now so I know exactly what to fix later.

But even if the code is fine forever, the comment still does something important: it shows I'm aware of the tradeoff. Say I come back to my project two years from now, open epub_math_fixer.py and see my terrible slow code. I ask "why did I write something so terrible?" Was it inexperience, time crunch, or just a random mistake?

The negative comment tells me that I knew this was slow code, looked into the alternatives, and decided against optimizing. I don't have to spend a bunch of time reinvestigating only to come to the same conclusion.

Why this can't be self-documented

When I was first playing with this idea, someone told me that my negative comment isn't necessary, just name the function RunFewerTimesSlowerAndSimplerAlgorithmAfterConsideringTradeOffs. Aside from the issues of being long, not explaining the tradeoffs, and that I'd have to change it everywhere if I ever optimize the code... This would make the code less self-documenting. It doesn't tell you what the function actually does.

The core problem is that function and variable identifiers can only contain one clause of information. I can't store "what the function does" and "what tradeoffs it makes" in the same identifier.

What about replacing the comment with a test. I guess you could make a test that greps for math blocks in the book and fails if there's more than 80? But that's not testing EpubMathFixer directly. There's nothing in the function itself you can hook into.

That's the fundamental problem with self-documenting negative information. "Self-documentation" rides along with written code, and so describes what the code is doing. Negative information is about what the code is not doing.

End of newsletter speculation

I wonder if you can think of "why not" comments as a case of counterfactuals. If so, are "abstractions of human communication" impossible to self-document in general? Can you self-document an analogy? Uncertainty? An ethical claim?


  1. One interesting exception someone told me: they make code "more self-documenting" by turning comments into logging. I encouraged them to write it up as a blog post but so far they haven't. If they ever do I will link it here. 

Thoughts on "The Future of TLA+"

4 September 2024 at 17:00

Last week Leslie Lamport posted The Future of TLA+, saying "the future of TLA+ is in the hands of the TLA+ foundation". Lamport released TLA+ in 1999 and shepherded its development for the past 25 years. Transferring ownership of TLA+ to the official foundation has been in the works for a while now, and now it's time.

In the document, Lamport also talks about some of his hopes for how the foundation will think about evolving TLA+. He gives three possible costs of any change:

  1. Breaking existing specifications
  2. Complicating the language. "Simplicity is a major goal of TLA+".
  3. Complicating the tooling. "There are few people contributing to the maintenance of the tools."

With this, Lamport proposes removing several features, like octal numbers, submodules and tuple quantification.1

I would be sad if we lost tuple quantification, less so if we lost octal numbers. But what could be added to make TLA+ better? Features that don't just "save a few keystrokes" but significantly prove the expressiveness of our specifications.

Some disclaimers. First, where I'm coming from: most of my contracts are either "get beginners started with TLA+" or "help a company write an exotic spec at the limit of what TLA+ can do." So I'm always thinking about how to teach the language better and how to abuse its most obscure features. Did you know about labeled subexpressions? You can do magic with labeled subexpressions.

So I both want the skill floor lowered and the skill ceiling raised. This is often a contradictory goal.

Second, these aren't official proposals or anything. I'm just sharing what I'd like to see, given infinite resources and no concerns about tradeoffs.

Finally, most of these aren't changes to TLA+. They're for TLC.

TLA+ vs TLC

TLA+ is the specification language. TLC is the model checker used to check that specs written in that language satisfy certain properties. It's not the only tool; you could use TLAPS or Apalache instead. But TLC is the most mature tool and the only one bundled with the standard distribution.

This is a critical distinction for two reasons. One, TLC can only model-check a subset of TLA+. There are a set of four "forbidden operators" that cannot be used in any capacity (though the lead TLC developer [Markus Kuppe] is working on adding one). Lamport lists them in section 3.3 of his essay.

Two, TLC makes some decisions about how to resolve some unspecified TLA+ operations. For example, CHOOSE x \in set: P(x) is undefined if no value in set satisfies P (Specifying Systems pg 73). In TLC, it's an error:

Attempted to compute the value of an expression of form
CHOOSE x in S: P, but no element of S satisfied P.

This catches a lot of design bugs where you expected something to exist but it doesn't. But it also makes specifying defaults awkward. What if I want to get the record with a certain id, but if there isn't one, return a default value NoRecord? Then I have to write this:

RetrieveRecord(id) ==
    IF \E r \in records: r.id = id
    THEN CHOOSE r \in records: r.id = id
    ELSE NoRecord

If the condition is longer, this gets more cumbersome, and there's always a risk of them getting out of sync.

The upside of TLC being separate from TLA+ is that we can add features to TLC without changing the overall semantics of TLA+. That's what Markus is doing in the TLC module: things like TLCGet and Assert adds new features to model checking TLA+ without adding new syntax to TLA+. It should be possible to make a new operator Try(Op, Fail) that evaluates Op and, if it fails, returns Fail instead. Then we can just do

RetrieveRecord(id) ==
    Try(CHOOSE r \in records: r.id = id, NoRecord)

With that in mind, most of the things I want in TLA+ are actually things I want in TLC, with one exception.

Things I wished were in TLA+/TLC

Try(Op, Fail)

It'd also be nice for equality checking. Right now "2" = 2 raises an error, even if I want it to be false.

(Could nested Trys lead to ambiguous behavior? Something to think about.)

A more obvious set filtering

This is the only change here that would affect TLA+ itself and not just TLC. This is map and filter in TLA+:

\* Map
{M(x): x \in set}
\* Filter
{x \in set: F(x)}

Beginners can never remember which is which. Even experienced people often forget which is which! I'd like a different keyword for each, so we could instead write

\* Filter
{x \in set WHERE F(x)}
\* Mapfilter
{M(x): x \in set WHERE F(x)}

Actual keyword may vary.

A linter

In the past few years TLA+ tooling has gotten a lot better. Among other things, there's now a debugger, a treesitter plugin, and an interactive spec explorer. I hope we soon see a linter, something that could detect TLA+ snippets with valid-but-unexpected behavior. In particular, something that could detect this:

/\ A
/\ B
 => C

That extra space before => means its parsed as A /\ (B => C), not as the expected (A /\ B) => C. I once lost a day to this. I know at least three other people who also lost a day to this. It is 100% valid TLA+ and we desperately need a tool to tell us when it happens because my gosh is it a morale-killer.

Support for the forbidden operators

There are four operators that are unsupported by TLC and virtually unused in any specs: \cdot, \EE, \AA, and -+->. I have no idea how to use \cdot but Markus is steadily adding support for it. \AA exists for completeness and even Lamport can't think of a good use for it. -+-> could be useful (It's the LTL weak release), but nobody's paid it any attention.

\EE is the interesting one, in that we know exactly how useful it would be. \EE x: P(x) is the temporal quantifier and is defined as "there exists a sequence of xs that make the temporal formula P(x) true for every step of the behavior. Critically, x can be a different value in each step, whereas \E x: P(x) would require the same value in each step.

Why is this so important? One word: refinement.

Say we have an abstract server model and we want make a more detailed version of just the server. We can refine it by instantiating the abstract spec like this:

Abstract == INSTANCE AbstractServer 
    WITH abstract_var1 <- formula1,
         abstract_var2 <- formula2,
         \* ...

Refinement == Abstract!Spec

See the above link for more details. The trouble is that we need to apply substitutions for every variable. If the abstract spec models both the server and the client, our refinement also needs to refine the client state, too! And more frustratingly, if the abstract spec has some kind of bookkeeping variable, say a state-sweeper or auxiliary variable, you have to refine that, too.

But TLA+ provides an out here. Using \EE, we can write the refinement like this:

Abstract(aux_var, client_var) == INSTANCE AbstractServer 
    WITH abstract_var1 <- formula1,
         abstract_var2 <- formula2,
         \* ...

Refinement == \EE a, c: Abstract(a, c)!Spec

Lamport calls this variable hiding (SS pg 41).

Now I don't know much about the internals of TLC, but my best guess is that \EE would be extraordinarily expensive to check. I think a regular refinement you can check as you go, since you can generate the full abstract transition at the same time you generate the main spec's transition. But with variable hiding, you only have information to create part of the abstract transition. I think the only way to do it would be to generate the abstract state graph first and then look for a transition that matches your current transition.

That is at the very least an NP-Complete problem, one you have to compute for every edge in your refinement graph. But maybe there are some special cases (like aux variables that do not affect behavior) which would be cheaper to check?

\E-stealing

This one's both the most niche and probably the hardest to implement, so I don't expect to see it anytime soon. But this is my newsletter and it bothers me specifically.

In Composing TLA+ Specifications with State Machines I used this trick to compose two specs:

Action1 ==
    /\ state = "start"
    /\ DoThing
    /\ state' = "wait"

Sync ==
    LET i == << state, state' >> IN
    CASE i == << "start", "wait" >> ->
        SomeSyncRules
      [] << "wait, start" >> ->
        DifferentSyncRules

So far, so good. But what if I take a nondeterministic action?

Action2 ==
    /\ state = "wait"
    /\ \E x \in set:
        DoThing(x)
    /\ state' = "start"


Sync ==
    LET i == << state, state' >> IN
    CASE i == << "start", "wait" >> ->
        SomeSyncRules
      [] << "wait, start" >> ->
        ???

There's no way for the body of Sync to know which value Action2 picked for x. I've found Sync actions are a great way to compose specifications, and this "opaqueness" limits how strong it can be. It'd be great to have some way of side-channeling out what x was picked, without changing the essential semantics of Action2. Preferably without even needing to modify it.

This might also be useful for test case generation, since you can augment the output traces with more information.

I have no idea what this would look like or how it would behave. It's just a pipe dream of mine.

Other pipe dreams

A better function update syntax. Some way to partially refine one spec into multiple, seamlessly upgrading their individual variables into function variables (without too much boilerplate). Being able to inline a config file in a spec so you don't need two separate files.[^inlining] Being able to call tlc without a config file and passing in flags. Support for -+->, just for giggles. Clock variables. Non-latex ASCII syntax. UNCHANGED (all vars except <>). Parallelized liveness checking. A tree that grants all my wishes.

[^inlining]: This is now provisionally possible! You can see an example here, run it with tlc -config Github866.tla Github866.tla

Comment Section

Buttondown now has comments! I'm experimentally enabling it for now, so you should be able to comment on the email's archive page. Again, this is experimental. If I end up having to spend all my time community moderating, comments are going down again. Be nice to each other.


  1. CHOOSE << x, y >> \in set \X set: x < y

State and time are the same thing

27 August 2024 at 18:13

Time is state

Imagine I put an ordinary ticking quartz clock in an empty room. I walk in, and ten minutes later I walk out with two photograph prints.1 In the 1st one, the second hand is pointing at the top of the clock, in the 2nd it's pointing at the bottom. Are these two copies of the same photo, or are they two different pictures?

A quartz clock with minute and second hands

There's no trick here, the answer is "different photos". Since the clock looks different, time must have passed between the two. More formally, we can represent the clock as a state vector of (hours, minutes, seconds). Since the two pictures have different state vectors, they must represent different photos.

Now I repeat the process and come out with two prints, both showing the same time. Are these two copies of the same photo or two different photos?

It's possible for the pictures to be different but there's no way to be certain. I could have taken the two photos half a second apart so that time passed but the clock didn't tick yet. There's no observational difference between "time didn't pass" and "time passed but the state vector is the same". We can model time only passing in one second increments, as any shorter passage of time is not reflected in the state vector.

Things would be different if you had access to the clock internals. The clock is powered by a quartz crystal that oscillates at approximately 2^15 hz, and a digital circuit inside the clock is counting that number as "one second". If you could read the memory inside the clock, then you could distinguish "00:00:15" and "00:00:15 + 2^14 oscillations".

But in our current system that state is internal to the watch. Until the circuits turn the internal value into an observable value, we cannot recognize the passage of time for the clock.

The only way we can see the passage of time is by measuring changes in observable state.

State is time

Pseudocode snippet:

void f(int &c) {
    *c += 2;
}

I'm trying to say that we pass a reference of a variable into f, so int x = 1; f(x); print(x) will output 3. Calling f(x) permanently splits our program into two eras. Before f(x), all calculations with x will get one thing. Anno Effexi, all calculations with x will get another thing. The update to x advanced time.

Now a more complicated case:

void f(int &c) {
    *c++;
    *c++;
}

Does this advance time one step or two? Depends on whether the program is single-threaded or concurrent. If it's single threaded, when f executes there's nothing else that can read x in between the two updates, so the first mutation is "internal". Externally it looks like there was only one observable mutation x += 2. On the other hand, if the program is concurrent, it's possible for some other thread or w/e to read x in between the two statements of f. That makes the internal mutation observable, so time is advanced twice. There are now three eras of our program with different possible behaviors.

Changes in state matter in that they create new times.

The consequences

Some function programmers stay "shared mutable state is the enemy". I think it's more like "time is the enemy", and time represents itself as mutable state. If a state update is purely internal and cannot affect the observable state (such as a statement in an uncommitted transaction), then it does not advance time.

I find this a good model to reason about abstract systems. It also motivates formal methods concepts like bisimulation and refinement, which I really need to do an updated post on.


  1. I've got a GendankenWorks instant digital camera that takes digital pictures but lets me print them too. I'm sure someone sells something like this. 

An idea for teaching formal methods better

21 August 2024 at 16:05

I was recently commissioned by a company to make a bespoke TLA+ workshop with a strong emphasis on reading specifications. I normally emphasize writing specs, so this one will need a different approach.

While working on it, I had an idea that might make teaching TLA+— and other formal methods— a little easier.

Pseudospecs

There are two problems to reading a spec:

  1. Learning the mental model of what TLA+ is actually doing
  2. Learning how to actually read TLA+.

The second problem is way, way bigger than it seems, because TLA+ has immense expressive power packed in a scary-looking syntax. Like take this spec:

---- MODULE foo ----
CONSTANT Servers
VARIABLE counters

Init ==
    counters = [s \in Servers |-> 0]

Zero(s) ==
    /\ counters[s] > 0
    /\ counters' = [counters EXCEPT ![s] = 0]

Inc(s) ==
    counters' = [counters EXCEPT ![s] = @ + 1]

Next ==
  \/ \E s \in Servers:
      \/ Zero(s)
      \/ Inc(s)

Spec == Init /\ [][Next]_counters
====

What's /\? What's ![s]? What's [][Next]_counters? To understand this spec, I have to understand how to read TLA+. This isn't an insurmountable obstacle, because otherwise nobody would know TLA+, but it could be the difference between "10 people learn TLA+" and "7 people learn TLA+".

My idea is to provide the spec along with a pseudospec, which could look like this:

Params {servers}

Initialize with {
   counters: Map(s in Servers: 0)
}

action Zero(s) {
    requires {
      counters[s] > 0
    }
    counters[s] := 0
}

action Inc(s) {
    counters[s] +:= 1
}

action Next {
  pick some s in Servers and { 
    either do Zero(s)
    or     do Inc(s)
  }

(Still working on what to do about the Spec operator.)

Couple things to notice. One: it's not a consistent translation between two languages, but between language and meaning. There's no formal grammar for the pseudospec. I need inconsistency for reasons I'll talk about later.

Two: the translation is not purely local. Things get moved around a little bit. Zero being an action affects how I translate \E.

Three: the translation is kind of metaphrasic. It's close to, but not exactly, a line-by-line translation. A person who has both files open in a split can correlate the spec and pseudospec.1

Most directly, this will help people understand what the spec is doing, which gives me a basis to teach model-checking, safety properties, etc. But I think this will also makes learning the TLA+ syntax easier, by acting like an answer sheet. The student can read the TLA+ and, if they get stuck, look at the pseudospec for meaning.

My moonshot hope is that this also helps with writing specs, by giving people a clearer sense of the standard TLA+ idioms. I can translate the same syntax slightly differently depending on the way it's used, and whether it's a general construct or one for just the spec. That might be doing too much with this concept, though.

Problems I see

First is that there's lots of nuances that would be lost in the translation. In writing

Init ==
   /\ x \in 1..5

-->

Initialize with {
  x: pick some in 1..5
}

I lose the idea that Init isn't special, it's just a boolean operator like anything else. The student will have to unlearn that misconception if they ever need to do tricks like FastInit.

This is an essential tradeoff in pedagogy: is it better to teach the full picture now or the easy thing first and fix the misconception later? The teacher side of me knows that the white lie is better in the long run, but the expert side of me hates this.

Second, not every TLA+ symbol has a 1-1 mapping with English. I did the translation

\E s \in Servers:
  \/ Zero(s)
  \/ Inc(s)

-->

pick some s in Servers and { 
    either do Zero(s)
    or     do Inc(s)
}

That's because Zero and Inc are actions— things that update state. If they were regular operators, If P and Q are regular operators, I'd translate it like this:

exists an s in S {
   P(s) || Q(s)
}

Depending on the context, I'll translate \E in two different ways.

This ties back to problem #1. In TLA+, these are the same thing. Performing an action is literally saying that it's a true description of the next-state relation. This has the consequence of looking like it updates the state relation, but that's only in certain (exceptionally common) circumstances. This is all incredibly elegant and intuitive to a TLA+ expert, but not for a beginner. So I have to introduce some inconsistencies.

Problem #3 is just how much syntax I'll need to translate. What do I do for [A -> B], WF_vars, or \AA? I can try to avoid them for the early specs but can't do that forever.

Overall, though, I think the potential benefit of easier learning will outweigh the drawbacks.

Will this work?

We'll find out!


  1. Experienced TLA+ users might notice that this is inspired by both PlusCal and Quint. I can't just teach those because the goal is to get people use TLA+ proper. (Also we really need a word for "experienced TLA+ user", like "Pythonista" or "Rustacean") 

Texttools dot py

14 August 2024 at 17:19

I make a lot of personal software tools. One of these is "texttools.py", which is easiest to explain with an image:

A GUI with the first three lines of this newsletter in the top, and the bottom has the three lines with markdown quote prefixes

Paste text in the top box, choose a transform, output appears in the bottom box. I can already do most of these transformations in vim, or with one of the many online tools out there, but I prefer my script for two reasons:

  1. There's no context switching. I don't have to leave my current tab or vim buffer and worry about cleanup later. One hotkey opens the GUI, <esc> closes it, with no break in my workflow.
  2. It loads in <10 ms.

It took me like an hour to make and I use it all the time. And it's small enough that I just share the whole script here.

How it works

Texttools is a python script running a tkinter GUI. I used tkinter because it's a builtin; I would generally not recommend it if you have any better options. On the plus side, being a builtin means you don't need to install a package to use this yourself.

import tkinter as tk
from tkinter import N, S, E, W, ttk


# Complex transforms go here,
# Simple transforms are just lambdas
def _wordcount(s: str):
    "Returns a tuple of linecount, wordcount, charcount"
    return (len(s.splitlines()), len(s.split()), len(s))

transforms = [
        # Transforms go here, for example
        {"name": "One line", "transform": lambda x: " ".join(x.splitlines())}
        ,{"name": "Line/Word/Char", "transform": _wordcount}
        ]

class GUI():
    def __init__(self) -> None:
        self.root = tk.Tk()
        self.active_transform = lambda x: x # start with no transform
        self.layout_gui()


    def layout_gui(self) -> None:
        self.mainframe = ttk.Frame(self.root, padding="3 3 12 12")

        self.root.title("Text Tools")
        self.mainframe.grid(column=0, row=0, sticky=(N, W, E, S))
        self.root.columnconfigure(0, weight=1)
        self.root.rowconfigure(0, weight=1)
        self.content_box = tk.Text(self.mainframe, undo=True)
        self.content_box.grid(column=2, row=2)

        self.output_box = tk.Text(self.mainframe, undo=True)
        self.output_box.grid(column=2, row=3)

        self.transform_box = tk.Listbox(self.mainframe)
        self.transform_box.grid(column=1, row=2, rowspan=2)
        for t in transforms:
            self.transform_box.insert(tk.END, t["name"])

        # Keyboard bindings
        self.root.bind("<Escape>", lambda _: self.root.quit())
        self.content_box.bind("<Tab>", lambda _: self.transform_box.focus())

        # vvv makes clicking or pressing enter change the transform
        self.transform_box.bind("<Button-1>", self.select_transform)
        self.transform_box.bind("<Return>", self.select_transform)

        # vvv makes anything typed in update the output
        self.content_box.bind("<Key>", 
                              lambda _: self.root.after(1, self.update))

        self.content_box.focus()

    def update(self):
        content = self.content_box.get('1.0', tk.END)
        content = self.active_transform(content) 
        self.output_box.delete('1.0', tk.END)
        self.output_box.insert('1.0', content)

    def select_transform(self, _):
        try:
            selection = self.transform_box.curselection()[0]
            self.active_transform = transforms[selection]["transform"]
            self.update()
        except (ValueError, IndexError):
            pass

def main(): # Entry point for pyproject.toml
    gui = GUI()
    gui.root.mainloop()

if __name__ == "__main__":
    main()

Man I forget how much I dislike tkinter until I have to look at it again. If you want to add your own text tools, just put a new item in the global transforms array.

To make it easier to run the script, I put it in a "toolkit" repo with this pyproject.toml:

[project]
name="toolkit"
version="0.0.1"
requires-python=">=3.8"

[project.gui-scripts]
toolkit-texttools = "toolkit.texttools:main"

Then running pip install -e . creates a toolkit-texttools binary (or in my case a .exe).

Finally, I wrote an AutoHotKey script so I could load it with a keyboard shortcut:

; & numpagedown + t both pressed at same time
NumpadPgDn & t:: toggle_app("Text Tools", "toolkit-texttools.exe")

I like mapping stuff to the numpad because it's guaranteed to not interfere with any OS or program-specific hotkeys.

Short newsletter this week because I'm still recovering from jetlag. See you all next week!

Why I prefer rST to markdown

31 July 2024 at 15:34

I just published a new version of Logic for Programmers! v0.2 has epub support, content on constraint solving and formal specification, and more! Get it here.

This is my second book written with Sphinx, after the new Learn TLA+. Sphinx uses a peculiar markup called reStructured Text (rST), which has a steeper learning curve than markdown. I only switched to it after writing a couple of books in markdown and deciding I needed something better. So I want to talk about why rst was that something.1

Why rst is better

The most important difference between rst and markdown is that markdown is a lightweight representation of html, while rst is a midweight representation of an abstract documentation tree.

It's easiest to see this with a comparison. Here's how to make an image in markdown:

![alttext](example.jpg)

Technically, you don't even need a parser for this. You just need a regex to transform it into <img alt="alttext" src="example.jpg"/>. Most modern markdown engines do parse this into an intermediate representation, but the essence of markdown is that it's a lightweight html notation.

Now here's how to make an image in rst:

.. image:: example.jpg
  :alt: alttext

.. image:: defines the image "directive". When Sphinx reads it, it looks up the registered handler for the directive, finds ImageDirective, invokes ImageDirective.run, which returns an image_node, which is an object with an alt field containing "alttext". Once Sphinx's processed all nodes, it passes the whole doctree to the HTML Writer, which looks up the rendering function for image_node, which tells it to output an <image> tag.

Whew that's a mouthful. And for all that implementation complexity, we get… an interface that has 3x the boilerplate as markdown.

On the other hand, the markdown image is hardcoded as a special case in the parser, while the rst image is not. It was added in the exact same way as every other directive in rst: register a handler for the directive, have the handler output a specific kind of node, and then register a renderer for that node for each builder you want.

This means you can extend Sphinx with new text objects! Say you that instead of an <image>, you want a <figure> with a <figcaption>. In basic markdown you have to manually insert the html, with Sphinx you can just register a new figure directive. You can even make your FigureDirective subclass ImageDirective and have it do most of the heavy lifting.

The second benefit is more subtle: you can transform the doctree before rendering it. This is how Sphinx handles cross-referencing: if I put a foo anchor in one document and :ref:`image <foo>` in another, Sphinx will insert the right URL during postprocessing. The transformation code is also first-class with the rest of the build process: I can configure a transform to only apply when I'm outputting html, have it trigger in a certain stage of building, or even remove a builtin transform I don't want to run.

Now, most people may not need this kind of power! Markdown is ubiquitous because it's lightweight and portable, and rst is anything but. But I need that power.

One use case

Logic for Programmers is a math-adjacent book, and all good math books need exercises for the reader. It's easier to write an exercise if I can put it and the solution right next to each other in the document. But for readers, I want the solutions to show up in the back of the book. I also want to link the two together, and since I might want to eventually print the book, the pdfs should also include page references. Plus they need to be rendered in different ways for latex (pdf) output and epub output. Overall lots of moving parts.

To handle this I wrote my own exercise extension.

.. in chapter.rst
.. exercise:: Fizzbuzz
  :name: ex-fizzbuzz

  An exercise

.. solution:: ex-fizzbuzz

  A solution

.. in answers.rst

.. solutionlist::

How these nodes are processed depends on my compilation target. I like to debug in HTML, so for HTML it just renders the exercise and solution inline.

When generating epub and latex, though, things works a little differently. After generating the whole doctree, I run a transform that moves every solution node from its original location to under solutionlist. Then it attaches a reference node to every exercise, linking it to the new solution location, and vice versa. So it starts like this (using Sphinx's "pseudoxml" format):

-- chapter.rst
<exercise_node ids="ex-fizzbuzz">
  <title>
    Fizzbuzz
  <paragraph>
    An exercise
<solution_node ids="ex-fizzbuzz-sol">
  <paragraph>
    A solution

-- answers.rst
<solutionlist_node>

And it becomes this:

-- chapter.rst
<exercise_node ids="ex-fizzbuzz">
  <title>
    Fizzbuzz
  <paragraph>
    An exercise
    <exsol_ref_node refuri="/path/to/answers#ex-fizzbuzz-sol">
      Solution

-- answers.rst
<solutionlist_node>
  <solution_node ids="ex-fizzbuzz-sol">
    <paragraph>
      A solution
      <exsol_ref_node refuri="/path/to/chapter#ex-fizzbuzz">
        (back)

The Latex builder renders this by wrapping each exercise and solution in an answers environment, while the epub builder renders the solution as a popup footnote.2 Making this work:

An example of solution popups on an epub reader

It's a complex dance of operations, but it works enormously well. It even helps with creating a "free sample" subset of the book: the back of the free sample only includes the solutions from the included subset, not the whole book!

"But I hate the syntax"

When I gush about rST to other programmers, this is the objection I hear the most: it's ugly.

To which I say, are you really going to avoid using a good tool just because it makes you puke? Because looking at it makes your stomach churn? Because it offends every fiber of your being?

...Okay yeah that's actually a pretty good reason not to use it. I can't get into lisps for the same reason. I'm not going to begrudge anybody who avoids a tool because it's ugly.

Maybe you'd find asciidoc more aesthetically pleasing? Or MyST? Or Typst? Or Pollen? Or even pandoc-extended markdown? There are lots of solid document builders out there! My point isn't that sphinx/rst is exceptionally good for largescale documentation, it's that simple markdown is exceptionally bad. It doesn't have a uniform extension syntax or native support for pre-render transforms.

This is why a lot of markdown-based documentation generators kind of hack on their own preprocessing step to support new use-cases, which works for the most part (unless you're trying to do something really crazy). But they have to work around the markdown, not in it, which limits how powerful they can be. It also means that most programmer tooling can't understand it well. There's LSP and treesitter for markdown and rst but not for gitbook-markdown or md-markdown or leanpub-markdown.3

But if you find a builder that uses markdown and satisfies your needs, more power to you! I just want to expose people to the idea that doc builders can be a lot more powerful than they might otherwise expect.


No newsletter next week

I'll be in Hong Kong.

Update 2024-07-31

Okay since this is blowing up online I'm going to throw in a quick explanation of Logic for Programmers for all of the non-regulars here. I'm working on a book about how formal logic is useful in day-to-day software engineering. It starts with a basic rundown of the math and then goes into eight different applications, such as property testing, database constraints, and decision tables. It's still in the alpha stages but already 20k words and has a lot of useful content. You can find it here. Reader feedback highly appreciated!


  1. rst is actually independent of Sphinx, but everybody I know who writes rst writes it because they're using Sphinx, so I'll use the two interchangeably. Also typing rST is annoying so I'm typing rst. 

  2. This is why I attach exsol_ref_nodes and not default reference_nodes. Sphinx's epub translator uses an attribute passlist I need to workaround in post-rendering. 

  3. This is also one place where rst's ugly syntax works in its favor. I've got a treesitter query that changes the body of todo directives and only todo directives, which is only possible because the rst syntax tree is much richer than the markdown syntax tree. 

Logic for Programmers Update

21 June 2024 at 18:02

I spent the early week recovering and the later week working on Logic for Programmers ([init] [update]) because I have a self-imposed deadline of mid-July, backed up by a $1000 toxx clause. Here's where I currently am:

1: The book is now 14k words. About 4k are "basics", covering propositional logic, predicate logic, and set theory. Another 1500 words or so are things like the introduction, appendixes, and preliminary exercises. The rest is "techniques". I'm hoping for 20k words for the pre-alpha first draft.

(I shared an early version of the "basics" on my patreon.)

2: Based on some feedback from Saul Pwanson, I've organized each "technique" as a three parter: first a common programming situation, then how logic applies, then something cool that knowing logic enables. For example, testing → specifications of functions → property testing. The other techniques are

  • Functional correctness → contracts
  • Data invariants → stronger database constraints
  • Case coverage → decision tables
  • Conditionals → simplifying conditionals

The last "technique" is an overview of tools that you need some knowledge of logic to start using in the first place: constraint solving, logic programming, formal verification, formal specification. I'm trying to decide if-and-how to cover Z3 and SAT solving, or if there's anything else I'm missing.

2.5: Two techniques I haven't written about but think could be good material: state invariants, modeling requirements. Not sure about the best approaches to either. I'm also looking for other possible applications of logic in general. The key is that it can't just be something that logic helps you appreciate, logic has to help you do it better.

3: Beyond adding those new sections, I want to have more details of technique, more examples, and some exercises. I don't know the right proportion of them yet.

I wrote a special exercise directive: for html, it puts the solution in a dropdown, while for pdfs it puts it in the back of the book. I really like how Sphinx supports these kinds of extensions, though I wish the ecosystem was a little more robust (or the docs better).

4: I think that when the alpha version is out in July, the book is about 40% done. This is because I suspect that reader feedback will lead to a lot of revisions and improvements to the content, as well as ideas for new technique chapters. After that's incorporated, it's 80% done, and that's when I'll figure out nice typesetting and a good cover image.

5: There's now a leanpub page. I wanted to figure out the logistics of distribution early. Please do not buy it yet.

Have a great weekend!

Nondeterminism in Formal Specification

11 June 2024 at 14:30

Just an unordered collections of thoughts on this.

In programming languages, nondeterminism tends to come from randomness, concurrency, or external forces (like user input or other systems). In specification languages, we also have nondeterminism as a means of abstraction. Say we're checking if a user submitted the correct password. A software implementation might look something like this:

# pythonish
def attempt_login(user, pw):
    submitted_hash = hash_password(pw)
    actual_hash = db.get_password(uid=user.id)
    if submitted_hash == actual_hash:
        login()
    else:
        reject()

There's lots of implementation details in confirming the correct password: how we hash it, how we determine the right hashing mechanism, where we retrieve the actual password from, what query we use to retrieve it, etc. At a higher level of design, though, we might not care about any of that: we only want to model what happens if the password is correct, not how we determine correctness. So we can use nondeterminism to handwave that all away:

\* TLA+
AttemptLogin ==
    \/ Login \* \/ means `either`
    \/ RejectLogin

When the model checker tries to evaluate AttemptLogin, it will branch and create separate checking executions for picking Login or RejectLogin. This gives us the consequences of "user submits the wrong password" without having to write a bunch of spec figuring out what the wrong password is.

Mastering abstraction-nondeterminism is key to writing good specifications. It's why you can model a complex, detailed system with a much smaller specification and find bugs that map back to actual possibilities.

Two challenges of nondeterminism

  1. Nondeterminism is incredibly capable, which makes it not very tractable (ibid). It's hard to express properties on subsets of the model-space, like "If a user never gets rejected, then property XYZ always holds."
  2. Abstraction-nondeterminism pushes the spec further from the real-world implementation, so it's harder to see that your implementation matches it.

These tend to be bigger problems if the nondeterminism is deep in your design, kinda like functional core, imperative shell. We want to push it closer to the edges:

AttemptLogin(pwd) ==
    IF pwd = GOOD THEN
        Login
    ELSE
        RejectLogin

Next ==
    \E pwd \in {GOOD, BAD}:
        AttemptLogin(pwd)

It's still nondeterministic, but it happens earlier, so is easier to control. We could for example swap {GOOD, BAD} with {GOOD} to guarantee that login is always successful.

In the extreme case, each component of the machine is locally deterministic, even if the components are concurrent. This is closest to something implementable. The world (ibid) can continue to be nondeterministic.

Initial states and prophecy variables

Specification languages can also have multiple initial states. We can use this to push nondeterminism into the spec setup by adding variables that encode what events will happen. For example, we could initialize the variable attempts_needed as an integer in 0..3, then write AttemptLogin like this:

AttemptLogin(pwd) ==
    IF pwd = GOOD THEN
        Login
    ELSE
        RejectLogin

Next ==
    IF attempts_needed > 0 THEN
        /\ AttemptLogin(BAD)
        /\ attempts_needed' = attempts_needed - 1
    ELSE 
        /\ AttemptLogin(GOOD)

Now, for each initial state, the spec is wholly deterministic, and thus tractable. We can easily distinguish properties that hold if the user never sends the wrong password (attempts_needed = 0) vs if they get it wrong a few times.1 We can also use this to check only a subset of the specification at a time, which can help with checking very large state spaces.

Leslie Lamport (the inventor of TLA+) calls these kinds of helpers prophecy variables, because they "predict" what nondeterministic action the system will take. You can't always use prophecy variables to completely remove nondeterminism from your specification, but it can help!

Refinement and nondeterminism

One of the main use cases of refinement— proving one specification implements another— is to remove abstraction nondeterminism from a system. I really need to add something about this to learntla.

An Exercise

I think this might be a good exercise to practice writing nondeterminism:

In the game Get Fifteen, players take turns picking numbers 1..9. Players can pick each number only once and cannot pick numbers their opponent picked. Winner is the first player to have three picked numbers that add to 15. This is identical to playing TicTacToe, but is easier to formally specify (no grid).2

Model a player tictactoe against an AI. Either can start first. The player needs to follow a deterministic strategy you implement. The AI just nondeterministically picks any available number. The property to check is "the player never loses".

(I haven't actually modeled this, but if people give it an honest shot and run into problems I'd be happy to give the specs a look!)


Plan for June: Logic for Programmers

I've decided to make a serious attempt to write an acceptable first draft of Logic for Programmers by July 15. If I don't have something I can readily turn into a Leanpub early-access by then, I will give $1k to a random newsletter subscriber (or donate it to a charity of their choice). I reserve the right to push the deadline if I get a new consulting contract.


  1. one thing we've lost is the user can no longer fail infinite times; they're always guaranteed to log in eventually. This is fixable in a few ways. 

  2. If you're not seeing it, try overlaying a tictactoe board on top of a magic square

I've been thinking about tradeoffs all wrong

5 June 2024 at 15:14

Sup nerds, I'm back from DDD Europe! Still processing the conference and adjusting to the horrible jetlag but also itching to get writing again, so let's go.

The opening keynote to the conference was called "Modern Tradeoff Analysis for Software Architecture". It was mostly about the factors that went into whether you should split up services into microservices and glue them together. Unsurprisingly "tradeoff" came up a whole lot: between throughput and latency, between simple and distributed transactions, between stability and volatility, etc. At the end he gave the example of comparing a queue-based architecture (SQS) to a pub-sub topic-based architecture (SNS).1 He said that the queues made it harder to add new services, but the topics make it harder to do access-control.2

That's when I realized I've been thinking about tradeoffs all wrong!

See, normally I think of tradeoffs as framing two positive things, like saying "SQS has better control while SNS is easier to add new services". Instead the speaker framed it as two wholly negative things, and the tradeoff is which negative thing you want less of. We can do this with any tradeoff, because any positive quality about X can be turned into a negative quality of Y. Instead of "X is faster, Y uses less space", we can say "X uses more space, Y is slower."

I don't know if this framing was intentional on the speaker's part, but I like it so much more. What's the difference? Even if you pick the least-bothersome drawback, you still have a drawback. Going back to the SQS vs SNS example, let's say you hate the how hard adding new services to SQS is, so you switch to SNS. Then you realize it hard to do access-control, and this is a worse problem than the SQS problem was. So the right tradeoff is to go back to using SQS.

But now it's hard to add new services again. And that sucks! You don't get to stop thinking about the pros and cons just because you made a decision. There's still a problem you have to deal with.

Two rules of thumb

Rule one: after I make a decision about a tradeoff, I need a plan to address the negative consequences of my choice, in a way that doesn't also add in the negative consequences of the other choice.

Rule two: this explains a lot of the movement (or more cynically, churn) in software tech. Even when we make the right choice overall, in the moment we're still under a lot of pain, and so we look for new tools/technologies/practices that address the pain we currently feel.

Bonus rule: part of the reason we do Software Mimicry is to reduce the pain of a choice we made.


The conference talk was my last big project for spring. My consulting calendar is free so I'll try to devote some time to my personal projects in "progress" (learntla updates, alloy workshop, logic book). It's a little tough without a hard deadline!

No "things I'm curious about" because I'm still v. tired


  1. He used SQS and SNS because they're the names of the AWS services. I'm going to use them as shorthands for the architectures because SQS is easier to write than "queue-based architecture" 

  2. I don't know if this actually true, but I'll go with it because that's how he framed it. 

NoCode Will Not Bring Computing to the Masses

21 May 2024 at 17:54

No Newsletter next week

I'll be giving my conference talk at DDD Europe.

NoCode Will Not Bring Computing to the Masses

I don't have a whole lot of time this week so here's something that's been on my mind a little. I haven't researched any of the factual claims; consider this more some incoherent early-stage theory.

Define a "NoCode" tool as "any tool made to help people without any programming experience do things that normally require programming." Most software engineers have a very dim view of NoCode, mostly because the primary way SEs interact with it is to clean up the messes left by overly ambitious nocoders. I don't think it'd be controversial to say there's a wide variety of problems nonprogrammers should not try to solve with NoCode tooling.

On the other side of the spectrum, there are a lot of problems that are trivial to do if you can program, and impossible without it. The canonical example I use is "upcasing a string"; others are "find strings of digits in a document", "figure out each student's average score across several tests", etc. If you don't know how to program, you either have to find someone else's program that does this thing or do it manually. This is where I think NoCode has real potential to help the average computer user: it can let them do simple programming tasks without having to learn all of programming first.

The goal, then, is to figure out what defines the space of "trivial for programmers, impossible for regulars". I think a good first-metric would be "equivalent lines of python". If something can be done in L=10 lines of not-very-clever Python then it should hypothetically be nocodable. If it takes H=50 lines, it definitely should not be. In between is a grey area. These numbers are just off the top of my head; finding the real values of L and H would require a lot of research and user examples.1

NoCode adapted to this space probably wouldn't be that profitable, but it would do the most good for the most number of people.

"The most good" will do the least good

I wanted to talk about some of the consequences for NoCode design, but couldn't figure out where to start, so did the wise thing and took a walk. On the walk I realized that no matter how much a tool empowers the user, almost nobody will use it.

Proof: how many programmers have iPhones? How many of them use Shortcuts? I mean, I have Tasker on Android, I know how to use it and have used it in the past, and still I regularly don't use it.

It's not enough for a tool to solve your problem for you to use that tool. You have to also A) recognize the problem you have is solvable, B) recognize the tool can solve it, and C) be reasonably confident that you personally can solve the problem with the tool. You need knowledge, skill, and the right mentality. Even programmers, who already have that mentality, don't automatically generalize it to all applicable domains. Otherwise we'd be doing a lot more with our phones.

It's unreasonable to expect the average person will use any NoCode to solve their problems, regardless of how easy the tools are. This leaves the people for whom it provides economic value, which is why all modern NoCode tools are oriented towards business.

Do people want grey markets?

There's two exceptions to this rule I can think of.2 The first is Excel/GSheets, which we do see regular people pick up. I'd venture this is because "very simple data organization" and "semi-complicated math stuff" is a common enough thing enough people need. Maybe also they have gentle on-ramps: you can solve basic problems while knowing very little. Finally, there could be a social aspect; everybody already knows about spreadsheets, so it's not as big a jump to start using them.

The other major exception is that I've seen a lot of people who do use tools like Shortcuts and Tasker… as consumers of other people's published automations. These tend to look more like "scripts" than "applications", suggesting that our official marketplaces are designed to deliver applications, but people also need scripts.3 The NoCode gives them a "grey market" they can use to install scripts without using the official heavyweight channels.

One negative consequence of this is that any widely-used NoCode also becomes a potent vector for malware, because the people who are running scripts are ill-equipped to audit them.

Ok, I checked the code and it said (and please bare with me here):

-If flibbity is frabjous, zibble wobble at 42% and blippity blop playlist; otherwise, schnoofle Grobber Mode.

SOMEBODY HELP — source


Trying to pull all of my thoughts into something coherent:

  1. People need scripts.
  2. NoCode can make scripting much easier.
  3. Even if NoCode makes scripting easier, people won't actually write scripts.
  4. What people want is to be easily able to find and run someone else's script like they do an application.
  5. NoCode could make finding and running other people's scripts easier, too.
  6. Doing so is potentially very dangerous, as people aren't able to evaluate scripts for safety.

  1. Have you ever noticed that High > Low, but H in vim moves you backwards and L moves you forward? Deep, man. 

  2. IFTTT/Zapier might be a third exception: people disconnecting disparate services with a NoCode tool. But I don't have enough firsthand knowledge to be able to say anything about them, and I don't know how popular they actually are. 

  3. The difference between "scripts" and "applications" seems very important to all this. See Why bother with Scripting? for a good discussion on this. 

What I look for in empirical software papers

16 May 2024 at 16:58

Behind on the talk and still reading a lot of research papers on empirical software engineering (ESE). Evaluating a paper studying software engineers is a slightly different skill than evaluating other kinds of CS papers, which feel a little closer to hard sciences or mathematics. So even people used to grinding through type theory or AI papers can find ESE studies offputting.

So here are some of my thoughts on how I approach it. This is mostly unedited and unorganized, just getting my thoughts down.

The high-level view

The goal of ESE is to understand how we develop software better in the real world. Compared to "regular" computer science, ESE has two unique problems:

  1. As with all kinds of knowledge work, software development is highly varied and lots of factors affect the final outcome.
  2. Software engineers develop new tools/processes/paradigms incredibly quickly and there's maybe a 1000:1 ratio of developers to researchers.

This makes it hard to say anything for certain with a high level of confidence. Instead of thinking of whether papers are correct or incorrect, I look for things that make me trust it more or less. I'll use this interchangeably with "like/dislike": "I like this study" is the same as "I trust it more".

Quantitative Metrics

I like metrics that are clear, unambiguous, and sensible to practitioners. "Quality" and "defects" are too ambiguous, "coupling" and "cohesion" aren't sensible. "Number of support tickets filed" and "distribution of model code lengths" are better.

In practice, clear and unambiguous metrics end up being more narrowly-scoped. I trust research on narrow metrics more than research on broad metrics, even if the former are less applicable to software at large.

Clinical Trials

I don't like "clinical trials", where developers are split into control and experiment groups and asked to complete a task. The problem is sample size: actual clinical trials in medicine are run on thousands of people at a time and cost millions of dollars. Most software clinical trials study less than 50 engineers at a time. With such a small sample size, there's just too much room for confounding factors.

It doesn't help that most studies are done on university students, which may not generalize to more experienced professionals.

There's a couple of exceptions, though. I'm more inclined to trust clinical trials how how to teach CS, as students are the target population anyway. And I pay more attention to clinical trials that find enormous effects. For example, Need for Sleep finds that a night of sleep deprivation reduced the trial group's coding quality by fifty percent.1 That's not the kind of effect you can write off as a confounding factor!

(I'd still like to see it replicated, though. Also, does mild sleep deprivation effect coding skill?)

Other Quantitative Research

Scraping GitHub for data is fine, as long as the researchers are careful to audit the data. Don't publish a groundbreaking paper claiming a correlation between language and program quality but accidentally count each bitcoin bug as 17 C++ bugs.

Developer surveys are good for measuring things like developer perceptions, emotional state, how widespread a technology is, etc. I don't think they're good at measuring the effectiveness of practices or determining if those opinions match reality. IE if a survey showed that 90% of developers think that their Java code is buggier than their Kotlin code, this tells us that developer perceive Kotlin as higher quality, not it actually is!

Qualitative Studies

I like qualitative studies, things that aren't gathering hard metrics. If a quantitative study of correctness imperative vs logic programming would be "which has fewer bugs" or "what kinds of bugs does each type have", qualitative would be more like "how do imperative and logic programmers debug stuff differently" or "let's dissect this one unusual bug and see if we learn anything interesting."

IME science fields undervalue the impact of qualitative research. I find a lot less of it than quantitative papers, but I tend to trust the qualitative papers more. Maybe they're just overall higher quality; more likely it's because they're more narrowly scoped. And I believe (without evidence) that a field needs a strong foundation of qualitative knowledge to do stellar quantitative work.

Some papers do both qualitative and quantitative work. The researchers in Abbreviated vs. Full-word Identifier Names started with a clinical trial, and then shadowed individual developers to see how they navigated a code file. I trust the quantitative sections of hybrid papers a little more than just-quantitative papers, but it's not an enormous factor.

"Quantitative" and "qualitative" aren't words anymore.

Reading the Paper

I use the same basic approach with all CS papers: read the abstract and intro, read the conclusion and post-results discussion, then decide if I care enough to look further. If I like the results, I read the methodology more carefully. I skim background info and "related work" mostly to see if they make any claims I'd want to read up on.

Sometimes I'll use an AI to generate a paper summary. I only do this if I'm going to read the full paper; the summary is just to help me digest it.

I trust papers that share their data.

Threats to validity

Most software papers have a "threats to validity" section, where they talk about possible confounding factors. In practice it's mostly a way for the writers to show they covered their bases, by explaining why said confounding factors aren't a problem. I like papers which include TtVs they can't dismiss, or discussing unusual-but-important TtVs. I don't like papers which have bad explanations for why TtVs aren't a problem, or that miss really obvious TtVs.

Example of doing it wrong: Quantifying Detectable Bugs in JavaScript claims that typescript catches 15% of "public bugs for public projects on GitHub". Their threats to validity were "we only looked at public bugs" and "we may have been too inexperienced in typescript to catch even more bugs." One TtV they do not list is that they uniformly sampled all of GitHub for bugs and didn't filter the projects at all. When I checked some of their projects, a bunch were simple lesson repos by novices in code bootcamps.2 Why didn't they mention this TtV?

Example of doing it right: oh God it's already Thursday and I haven't gotten this newsletter out yet

References

For many papers, the most useful part is the reference section, because odds are I'll find a more interesting paper there. Even if none of the references look interesting, I'll still check a few to make sure the original paper is using them correctly. I don't like it when a paper that say reference P makes claim C about topic T, and then I read R and it never even mentions T.

I wish journals let papers have two reference sections, one for the really essential references and one for the incidental ones, like background information or related work.

...Okay, that's all I got this week. Back to the talk mines.


Things I'm curious about

  • DORA Report: What's the state of research about the DORA report? Has anybody been able to replicate their findings? Has anyone evaluated them critically?
  • Puzzle Games: Puzzle games like flow free have thousands of levels. Are these programmatically generated? If so, how is it done, and how do they control the difficulty?

  1. Measured as as "number of tests passed", from a researcher-provided test suite for the task. 

  2. I checked this back in 2019 or so, but it seems like the data is now offline. I checked this back in 2020 or so. Props to making the data available! 

Paradigms succeed when you can strip them for parts

7 May 2024 at 16:48

I'm speaking at DDD Europe about Empirical Software Engineering!1 I have complicated thoughts about ESE and foolishly decided to update my talk to cover studies on DDD, so I'm going to be spending a lot of time doing research. Newsletters for the next few weeks may be light.

The other day I was catching myself up on the recent ABC conjecture drama (if you know you know) and got reminded of this old comment:

The methods in the papers in question can be used relatively quickly to obtain new non-trivial results of interest (or even a new proof of an existing non-trivial result) in an existing field. In the case of Perelman’s work, already by the fifth page of the first paper Perelman had a novel interpretation of Ricci flow as a gradient flow which looked very promising, and by the seventh page he had used this interpretation to establish a “no breathers” theorem for the Ricci flow that, while being far short of what was needed to finish off the Poincare conjecture, was already a new and interesting result, and I think was one of the reasons why experts in the field were immediately convinced that there was lots of good stuff in these papers. — Terence Tao

"Perelman's work" was proving the Poincaré conjecture. His sketch was 68 pages of extremely dense math. Tao's heuristic told him it was worth working through is that it had interesting ideas that were useful to other mathematicians, even if the final paper was potentially wrong.

I use this idea a lot in thinking about software: the most interesting paradigms are ones you can "scavenge" ideas from. Even if the whole paradigm isn't useful to you, you can pull out some isolated ideas, throw away the rest, and still benefit from it. These paradigms are the ones that are most likely to spread, as opposed to paradigms that require total buyin.

Let's take as an example functional programming (FP). ML/Haskell-style FP has a lot of interconnected ideas: higher-order functions, pattern matching, algebraic data types, immutable data structures, etc. But if you ignore all of that away and stick to writing everything in idiomatic Java/Python/C, but the only change you make is writing more pure functions, then you've already won.

This has several positive effects on FP adoption. It immediately demonstrates that FP has good ideas and is not just some weird cult practice. It also means people can "dip their toes in" and benefit without having to go all in. Then they can judge if they're satisfied or want to go further into learning FP. And if they do go further, they can learn new ideas gradually in stages.

By contrast, compare array programming languages (APLs). There's lot of cool ideas in APL, but they're so interconnected it's hard to scavenge anything out. If I want to use multidimensional arrays in Python I can't just whip it up myself, I have to import an external library like numpy and learn a whole bunch of new ideas at once. So most of the ideas in APL stay in APLs.3

A related idea how techniques are only adopted if you can "half-ass" them. TDD is useful even if you only use it sometimes and phone in the "green-refactor" steps, which helped adoption. Cleanroom depends on everybody doing it properly all the time, which hindered adoption.

Scavenging from formal methods

Part of the reason I'm thinking about this is that business has been real slow lately and I'm thinking about how I can make formal methods (FM) more popular. What ideas can we scavenge from writing specifications?

Obvious ones are predicate logic and the implication operator, but they're both poorly supported by existing programming languages, and I'm not sure how to best pitch them.2 Also, they're "too big", and it takes a lot of practice to wrap your mind around using quantifiers and implication.

Sets, definitely. Sets are unordered, unique collections of objects. Lots of things we store as lists could be sets instead. I use sets way more after learning some FM.

There's also a lot of high-level ideas in FM that usefully apply to all other code. System invariants are things that should always hold in every possible state. More broadly, we can categorize properties: "safety" is roughly "bad things don't happen", "reachability" is "good things always could happen", and "liveness" is "good things always do happen". Fairness, maybe. Refinement is definitely too in-the-weeds to be scavengable.

I'd also call out decision tables as a particular technique that's extractable. You can teach decision tables in five minutes to a completely nontechnical person.

As I said, short newsletters for the rest of this month. Have a great week!


Things I'm curious about

  • Logic Programming: What ideas are scavengable from LP? I know that pattern matching originated in LP and found its way to FP from there, but that's associated with FP more now.
  • Nontechnical People: What's a good word for them? "Nondev" seems like it leaves out data scientists, DBAs, etc. "Normie" and "Muggle" are too offensive. "Layfolk" just sounds weird.
  • TLA+ workshops: Would people be interested in an "intermediate" TLA+ workshop, for people who've already written production specs but want to get better? I'm thinking half a day, 6-8 people, with people bringing up what they want to learn about in advance. If there's enough people who'd want this I can plan something out.

  1. Domain-Driven Design. 

  2. This is one big reason my attempt to write a book on predicate logic has been in a long stall. 

  3. One idea that mathematicians scavenged from APLs is the Iverson bracket: [P] = P ? 1 : 0

"Integration tests" are just vibes

1 May 2024 at 15:52

New blog post! Software Friction is about how all the small issues we run into developing software cause us to miss deadlines, and how we can address some of them. Patreon here.

"Integration tests" are just vibes

You should write more unit tests than integration tests. You should write more integration tests than unit tests. "Integrated tests" are a scam.

But what exactly is an integration test? Is it tests that cross module boundaries? Is it tests that depend on several pieces of behavior, even in one module? Is it anything that tests "emergence"1? Anything with side effects? If get_active_users() runs a database query, are all tests against it integration tests?

No wonder both Twitter and Google have (at one point) claimed that "integration test" wasn't a useful term. There are just too many different things people mean "integration".2

Even if we can't define integration test, we still have recognize them. There are tests that are bigger than our "unit tests" and smaller than an end-to-end test, and people have strong opinions about writing more or less of them. We can have a sense of something being an "integration test" even if we can't strictly define it.

I think it's more useful to think of integration tests in terms of family resemblances. There are certain measurable properties of tests, and integration tests measure relatively different than a unit test. One quality is test speed: unit tests are "fast", integration tests are "slow". This does not mean all integration tests are slower than any unit test (unit property tests can be very slow). But slow tests are more likely to be integration tests than unit tests.

"Integration tests" are a vibe we get based on how we perceive a test's properties. Some of these properties are:

  1. The number of lines of production code executed as part of the test.

  2. Number of boundaries crossed, module/object or otherwise.

  3. Number of unit tests you can decompose the test into.

  4. Number of external services required to run the test.

  5. Amount of engineering effort to write/run the tests. Does it need separate files, a separate language, setting up lots of docker instances?

  6. Nonlocality of discovered bugs: if the test fails, how many lines you have to look at to find the problem.

  7. Likelihood a test failure is due to "emergence": things are correct in isolation but interact in a problematic way.

The more of these have high values, the more likely a test is to be an integration test.

Miscellaneous Thoughts

A test that hits the database (external service) can still be a unit test if it runs quickly and covers a small amount of code. Tests that make HTTP calls are less likely to be fast, so more likely to be integration tests.

Mocks and pact-style testing push some of these values down, by reducing boundaries crossed and lines of code executed at the cost of increased test infrastructure. Contracts and assertions increase bug locality, since the test fails the moment a pre/postcondition is broken instead of when the test completes.

If you push the values high enough, do you get end-to-end tests instead of integration tests? Or are E2E tests fundamentally different from integration tests in some way?

Is it possible for a unit test to uncover emergence, or is emergence a sufficient condition to say that something is definitely an integration test?


Things I'm curious about

Gonna make this a weekly thing maybe (or whenever I have enough new stuff to ask about).

  • Sokoban. Sokoban is "easier" than Chess or GO (PSPACE vs EXPTIME), but Sokoban AIs regularly fail at simple puzzles that humans can do (ref ex1 ex2). What makes Sokoban "so much harder" for AIs? I don't think it's just a lack of interest because I've found Google and Deepmind papers on solving sokoban (and not making much progress).
  • NFCs: I impulse-bought a bunch of NFC chips. I'm already using them to open train trackers, what else can I do with them? I'm not a mobile dev but can whip up Taskers and server code pretty quickly.

If you know something about either of these, feel free to share! Or write about them publicly and share the link with me. You can just reply directly to this email.


  1. Geeze I can't believe I tried to make "automanual" a thing 

  2. "Unit test" is also fuzzy but we at least agree that add(2, 3) == 5 is a unit test. 

"Testing can show the presence of bugs but not the absence"

23 April 2024 at 20:19

Program testing can be used to show the presence of bugs, but never to show their absence! — Edgar Dijkstra, Notes on Structured Programming

Dijkstra was famous for his spicy quips; he'd feel right at home on tech social media. He said things he knows aren't absolutely true but will get people listening to him. In the same essay he discusses how testing could, in specific cases, show the absence of bugs:

  • If a function has a finite set of inputs, like floor(float), you could prove correctness by simply testing all possible inputs.
  • Even if a function has infinite inputs, tests prove the absence of bugs for tested inputs. You could say "date_of_easter works for any date within 10,000 years of today" and have "proven the absence of bugs" for the subset of dates people care about.1
  • You can use testing to complete a proof. For example, you can divide inputs into "equivalence classes", prove that if the function is correct for one value it will also be correct for all values in that class, and then test one value from each class.

In the majority case, however, testing cannot show the absence of bugs. Testing f(16) does not give you any guarantees about all x. f(x).

There are two possible takeaways. The reasonable one is that we shouldn't be blindly confident in testing, should use a variety of approaches as part of defense in depth, and always be careful about what we claim. The unreasonable takeaway is that testing is an inferior verification method, and we should use more "rigorous" approaches instead of testing.

It's unreasonable because methods that show the existence of bugs are more robust than methods that show the absence of them, and this makes them more useful more of the time.

Comparing tests and proof

Let's say you have a pseudocode function like this:

fun elem_before_x(s: int[], x: int) returns (r: int | NULL) {
    let i: int = index(s, x);
    if (i == NULL) {
        return NULL;
    } else return s[i-1];
}

Property:
 if r == NULL then x notin s
 else some i in 0..<len(s): s[i] == r && s[i+1] == x

The bug here is that we're out-of-bounds if x is the first element of the array. Testing will find this bug if you remember to test that case. Otherwise, we'd have a bug and a passing test suite. Tests can't verify the absence of bugs.

Now imagine we do find it and fix it:

else if (i == 0) {
        return s.last;
    } 

Modifying the property is left as an exercise to the reader. Once again, the passing test suite means we haven't verified the absence of the bug. Now, how does formal verification do on the same problem? I wrote a Dafny version of the same problem, here, both with and without the fix. Looking at v1, we see it finds the s[0] error:

>   var v := index.Extract();
>   return Some(s[v-1]);

(19,16): Error: index out of range    
   |
19 |     return Some(s[v-1]);

This error also goes away in v2: the bug is proved to be absent. But v2 adds a new error:

> ensures if r.None? then x !in s
>           else if s[|s|-1] == r.value then s[0] == x
>           else s[IndexOf(s, r.value)+1] == x

(7,20): Error: index out of range
  |
7 |             else if s[|s|-1] == r.value then s[0] == x
  |                      ^^^^^^^

When would s[|s|-1] be an index out of range? When the list is empty. But we know that's not possible: if we return a non-null value then the list must not be empty, because we found x in it! But the prover doesn't know that, it can't infer the relationship between the returned value and what we know about the list.2 We have to annotate our code with proof hints for it to recognize that if we return a number, the list is nonempty.

In short: the prover definitely finds the out-of-bounds bug, but also raises a spurious "empty list" bug. The test might not find the former but also won't raise the latter. Tests can have false negatives but not false positives, while provers can have false positives but not false negatives.

(I'd say "formal verification can prove the absence of a bug but never the presence", but that's not true. Dafny can sometimes give you a concrete counterexample to a property, and type systems detect true positives like "you passed an array into an Int -> Int function" all the time.)

Knowing there "might" be a bug is not as useful as knowing there is one. Empirically speaking people mostly ignore compiler warnings and plenty of retrospectives find people ignore "might possibly be a bug" in even mission-critical software. Formal verification often becomes an "all or nothing" deal, either you totally prove the absence of bugs in programs or you get no benefits at all.

And that sucks! It takes a lot of effort to prove something correct, magnitudes more than running some tests. And that's why I consider testing "more robust": even a little bit of testing may find a true bug, while a little bit of verification doesn't get you much.

Correctness is not a thing

I'll end this by getting on my soapbox. Here's something that's really important to understand about "proving the absence of bugs":

You do not prove something's correct. You prove conformance to specification.

It's possible for code to conform to the specification and still have bugs. Two common cases are:

  1. The spec is wrong or incomplete, like v2.dfy is in the above example. Spoiler in the footnote.3
  2. The spec is based on an assumption that's not true in practice, like "no other program is writing to the same file."

This is also important because it puts the specification front-and-center. You can write tests without having a specification in mind, but you need a spec to do any kind of formal method. Doing FM trains you to think about specs, which in turn helps you write better tests.


Things I am interested in

  1. Crossovers from fields besides classical engineering, who worked professionally as something else before switching to software. What ideas from your old job as a journalist/librarian/social worker/acrobat could be useful in software, and vice versa?
  2. Wargaming in software. Have you designed or participated in a devops/security/system design/whatever simulated game? Was it helpful?

If you have insights into either of these, feel free to share! Or write about them publicly and share the link with me. You can just reply directly to this email.


  1. Though you could argue this isn't sufficient for security. If easter(-1e8) causes a memory overflow then that's a bug even if a normal user won't ever call it. 

  2. We're lucky that the Dafny standard library defines a postcondition for index functions saying the output is less than the length of the list. Otherwise the prover would raise a potential out-of-bounds error on s[IndexOf(s, x)]

  3. s[|s|-1] == r.value then s[0] == x isn't true for the input [a, r, x, r]. The good news is that Dafny refuses to verify the code because it didn't match the spec. I only realized this when I was error-checking this piece. (This doesn't refute the point I made above about false positives: the false positive was that s[|s|-1] was out of bounds!) 

What makes concurrency so hard?

16 April 2024 at 20:28

A lot of my formal specification projects involve concurrent or distributed system. That's in the sweet spot of "difficult to get right" and "severe costs to getting it wrong" that leads to people spending time and money on writing specifications. Given its relevance to my job, I spend an awful lot of time thinking about the nature of concurrency.

As the old joke goes, concurrency one of the two hardest things in computer science. There are lots of "accidental" reasons why: it's hard to test, it's not composable, bugs can stay latent for a long time, etc. Is there anything that makes it essentially hard? Something that makes concurrent software, by its very nature, more difficult to write than synchronous software?

The reason I hear most often is that humans think linearly, not concurrently, so are ill-equipped to reason about race conditions. I disagree: in my experience, humans are very good at concurrent reasoning. We do concurrent reasoning every time we drive a car!

More generally, some studies find that if you frame concurrent systems in human terms ("meatspace modeling"), people get quite good at finding the race conditions. So while concurrency might be difficult to reason about, I don't think it's because of a fault in our brains.

In my opinion, a better basis is state space explosion. Concurrency is hard because concurrent systems can be in a lot of different possible states, and the number of states grows much faster than anyone is prepared for.

Behaviors, Interleavings, and Nondeterminism

Take agents1 {1, 2, … n}, which each executes a linear sequence of steps. Different agents may have different algorithms. Think writing or reading from a queue, incrementing a counter, anything like that. The first process takes p1 atomic steps to complete, the second p2, etc. Agents complete their programs strictly linearly, but another process can interleave after every atomic step. If we have algorithms A1A2 and B1B2, they could execute as A1B1A2B2 or A1B1B2A2, but not A1B2B1A2.

Under those conditions, here's an equation for how many possible orderings of execution (behaviors) can happen:

(p1+p2+...)!
------------
p1!*p2!*...

If we have three agents all executing the same 2-step algorithm, that's 6!/8 = 90 distinct behaviors. Each step in the sequence can potentially lead to a distinct state, so there's 6*90=540 maximum distinct states (MDS). Any one of those can potentially be the "buggy" one.

In most cases, the actual state space will be significantly smaller than the MDS, as different behaviors will cross the same states. Counterbalancing this is just how fast this formula grows. Three 3-step agents gives us 1700 possible behaviors (15K MDS), four 2-step agents instead have 2500 (20K MDS). And this is all without any kind of nondeterminism! If one step in one of the agents can nondeterministically do one of three things (send message M₁, send message M₂, crash), that triples the number of behaviors and the MDS.

It's pretty common for complex concurrent systems to have millions or tens of millions of states. Even the toy model I use in this article on model optimization has about 8 million distinct states (though with some work you can get it much lower). I mostly think about state spaces in terms of performance because large state spaces take a lot longer to model-check. But it's also why concurrency is essentially hard to reason about. If my theoretical MDS is two million states, my practical state space is just 1% that size, and my human brain can reason through 99.9% of the remaining states… that still leaves 20 edge cases I missed.

Shrinking the State Space

Here's a heuristic I use a lot:

All means of making concurrency 'easier' are concerned first and foremost with managing the state space.

It's not 100% true (no heuristic is) but it's like 60% true, and that's good enough. State spaces grow quickly and bigger state spaces cause problems. If we want to make maintainable concurrent systems, we need to start by shrinking the space.

Like look at threads. Threads share memory, and the thread scheduler has a lot of freedom to suspend threads. So you have lots of steps (potentially one per line of code)2 and any interleaving can lead to a distinct state. I can use programming constructs like mutexes and barriers to "prune" the state space and give me the behaviors I want, but given how big the state space can be, I have to do a lot of pruning to get the right behaviors. I can make mistakes in implementation, "misshape" the space (like by adding a deadlock), or not notice a buggy state I need to remove. Threads are very error prone.

I could instead switch to memory-isolated processes. The scheduler can still schedule whatever interleavings it wants but the processes can't muck with each other's internal memory. Internally a process is still executing A1A2A3A4, but if the only steps that involve external resources (or interprocess communication) are A2 and A4, then we only need to think about how A2A4 affects the state space. Three four-step agents have 35k interleavings, three two-step agents have only 90. That's a big improvement!

What else can we do? Low-level atomic instructions do more in a single step, so there's no room for interleaving. Database transactions take a lot of physical time but represent only one logical time step. Data mutations create new steps, which immutable data structures avoid by definition.

Languages have constructs to better prune the resulting state space: go's channels, promises/futures/async-await, nurseries, etc. I think you can also treat promises as a way of forcing "noninterleaving": wait until a future is ready to execute in full (or to the next yield point) and before execution. Please don't quote me on this.

I think CRDTs reduce state space by permitting interleavings, but arranging things so that external changes are commutative: A1B1 and B1A1 lead to the same final result, so there are not distinct states.

Again, this is all a very rough heuristic.

Limits

To a first-order approximation, smaller state space == good. But this doesn't account for the topology of the space: some spaces are gnarlier than others. One that has lots of different cycles will be harder to work with than one that's acyclic.3 Different forms of nondeterminism also matter: "the agent continues or restarts" leads to more complex behavior than "the writer sends message M₁ or M₂." These are both places where "humans are bad at reasoning about concurrency" appears again. We often work through concurrency by reduces groups of states to "state equivalence classes" that all have the same properties. complicated state spaces have more equivalence classes to work through.

(Another way "humans are bad at reasoning about concurrency" can be a real thing: we might not notice that something is nonatomic.)

Some high-level paradigms can lead to particular state space topologies that have fewer interleavings or ones that have more equivalent states. I've heard people claim that fork-join and pipe-filter are especially easy for humans to reason about, which I take to mean "doesn't lead to a gnarly state space". Maybe also event-loops? Where does the actor model fit into all of this?

Another limit: there's a difference between buggy states and buggy behaviors. Some behaviors can go entirely through safe states but still cause an bug like "never reaches consistency". This is called a "liveness bug", and I talk more about them here. Liveness bugs are much harder to reason about.

Okay, that's enough philosophical rambling about concurrency. Concurrency is hard, don't feel bad if you struggle with it, it's not you, it's combinatorics.


Video Appearance

I was on David Giard's Technology and Friends talking about TLA+. Check it out here!


  1. An agent is any concurrent actor, like a thread, a process, a future, a human, etc. Concurrency literature uses "processes" a lot but that clashes with OS processes. 

  2. Or more! If writes are nonatomic, you can get data races, where two threads write to different bits in a byte at the same time. Then a variable assignment can be multiple steps in one line. 

  3. Is there any research correlating unusual subgraphs in state spaces with probable bugs? 

Some notes on for loops

10 April 2024 at 17:08

New Blogpost

Don't let Alloy facts make your specs a fiction, about formal methods practices (and a lot of Alloy). Patreon link here.

Some notes on for loops

I sometimes like to sharpen the axe by looking at a basic programming concept and seeing what I can pull out. In this case, for loops.

There are two basic forms of for loops:

for(i = 0; i < 10; i++)

for val in ["a", "b", "c"]:

The second is usually called a foreach loop, and I'll call the former "forstep", after the ALGOL-60 construct. Almost all languages have one of the two, some have both. While forstep loops show up early, foreach loops only appear in PLs ~20 years later.

Looking at this, here's some things I see:

Construct reduction

Given a foreach loop, I can syntactically translate it into a forstep loop:

for val in arr:
    # ...

-->

for(i = 0; i < arr.len; i++) {
    val = arr[i];
    // ...
}

There's a lot of terms we use to express this property. We can say that foreach is homomorphic or reducible to forstep, or that foreach can be constructed out of forstep, or others I can't think of rn.

If we could also construct forstep from foreach, we could say the two are equivalent or isomorphic. In theory we can do this if we had a range operator:

for(i = 0; i < 10; i += 2) 

--> 

for i in range(0, 10, step=2);

Most forstep implementations, though, can do things like modify the iterator in the middle of the loop:

for(i = 0; i < 10; i++) {
    // stuff
    if(cond) i = 0;
    // stuff
}

You can't convert this to a foreach loop: it's an "irreducible" feature.

Most people consider modifying the index bad practice, though. 99% of the use cases of forstep loops are constructible with foreach loops, so the two are morally equivalent. I'll treat them as roughly interchangeable going forward.

Primitives and Expressivity

For loops are reducible to while loops:

for(i = 0; i < 10; i += 1) 

-->

i = 0;
while i < 10 {
  // ...
  i++;
}

But while loops are not reducible to for loops:

while(user_cmd != "quit")

-->

???

Again, lots of terms to cover this: while loops are more "primitive", more "powerful", more "low-level", more "expressive", etc. If you have while loops, you don't need for loops, and if you have goto, you don't need either.

But you shouldn't use a goto when a while loop works or a while loop where a for loop works. This is a generalization of the rule of least power (RLP): given a problem, use the tool that is most specific to the problem. If you can iterate through a list with a foreach loop, don't use a while instead!

There's two reasons for this. One is that it's more communicative to other programmers. If they see a foreach loop, they know it's going to be used for some kind of iteration, while that's not clear with the while loop. Two is the capability/tractability tradeoff: the more limited a construct is, the easier it is to understand and work with. A foreach loop can do less than a while loop, but it's much easier to guarantee that it processes each item only once.

Programming languages evolve to replace the all-powerful "low-level" constructs with more limited constructs that reduce to the low-level ones. It's why you get things like interfaces over inheritance.

The corollary of this is that if you use a powerful construct, it should be because the limited one can't do what you want. Don't force an infinite for-loop!

Extensions to for loops

For loops reduce to while loops, while loops reduce to goto, everything reduces to goto. So there are some kinds of looping goto can do that for and while cannot.

They get a little closer to feature-parity by adding flow-control inside the loop: continue for advancing the loop early, break for terminating it. I didn't look into when or how these were added, but I'd guess they come from looking at what goto can do that loops could not, and adding features as needed.

"Imperative programming" is just for loops

I think one of the most basic functions of a programming language is to iterate through an array. So much so that you can distinguish different programming paradigms by how they iterate.

  • Imperative programming: for loops
  • OOP: a foreach method on iterable objects[^python]
  • Functional programming: recursion, reduce (we'll talk about this later)
  • Logic programming: I'm less familiar with this, but I think it's recursion + pattern matching (which ML and Haskell adopted)
  • Array programming: all operations take arrays by default.

I think this explains why arrays are universal while dictionaries are relatively new: there's no obvious basic operation that covers what you do with dicts to the degree that for covers what arrays do.

(Question: do list comprehensions lead to a new paradigm? How do you describe "iteration" in spreadsheet languages?)

Functional replacements

When people talk about replacements for foreach loops in FP, they usually bring up map and filter. But these are not equivalent. Rather map and filter are "preferred" to loops because they are more limited. It's another instance of RLP.

As far as I can tell, reduce is equivalent to foreach. Morally, anyway. This is interesting because reduce is considered part of the "big three" (along with map and filter), but it's also the one that's the "odd one out". It's the hardest for people to understand, and Haskell even has three versions of it. Is it confusing because it's so powerful?

Map and filter are composable in a way that foreach loops are not, but they're also weirdly less "composable" in one very specific way? Consider something of the form

map f (filter p arr))

-->

out = []
for x in arr:
  if p(x):
    out.append(f(x))

The map/filter iterates over the list twice, while the foreach loop iterates only once. I know Haskell can fuse maps together as an optimization but I don't think you safely fuse arbitrary map/filters? I dunno.

You can fuse map and filter with a reduce:

reduce ((out, x) => 
        if p(x) 
        then out ++ f(x)
        else out
       ), [], arr

This gets really awkward though.

I wonder if "reduce is equivalent to for loops" plus "reduce is awkward for fusion" somehow inspired clojure transducers.

I'll be writing this forever

Things I could have looked at in more detail but ran out of time and energy:

  • "Iterators" as a general concept.
  • Loop invariants and variants. This is really relevant to formal methods in particular.
  • Specific implementations of for loops in languages. ALGOL's loop expression-lists, Python's for else, Raku's phasers, javascript's for in/for of distinction, Ada and Chapel loop parallelization constructs... If I keep going I'll get something as long as this.

Following any of those leads would get us to different insights, as opposed to main thing that ended up catching my interest with these notes, which is how the rule of least power applies to language constructs. I should probably write a blog post about it!

Anyway I think this kind of exercise, looking at something everybody takes for granted and try to learn things from it, is really valuable for understanding software better. I don't know how interesting it is to read about, though. Lemme know if you think you'd like to see this for other stuff, or just share the essays that are downstream of the practice (like writing about RLP).


Podcast appearance: are we engineers?

I was on a podcast, talking about the crossover project and what software engineers can learn from traditional engineering. Check it out here!

Why do regexes use `$` and `^` as line anchors?

25 March 2024 at 16:40

Next week is April Cools! A bunch of tech bloggers will be writing about a bunch of non-tech topics. If you've got a blog come join us! You don't need to drive yourself crazy with a 3000-word hell essay, just write something fun and genuine and out of character for you.

But I am writing a 3000-word hell essay, so I'll keep this one short. Last week I fell into a bit of a rabbit hole: why do regular expressions use $ and ^ as line anchors?1

This talk brings up that they first appeared in Ken Thompson's port of the QED text editor. In his manual he writes:

b) "^" is a regular expression which matches character at the beginning of a line.

c) "$" is a regular expression which matches character before the character <nl> (usually at the end of a line)

QED was the precursor to ed, which was instrumental in popularizing regexes, so a lot of its design choices stuck.

Okay, but then why did Ken Thompson choose those characters?

I'll sideline ^ for now and focus on $. The original QED editor didn't have regular expressions. Its authors (Butler Lampson and Peter Deutsch) wrote an introduction for the ACM. In it they write:

Two minor devices offer additional convenience. The character "." refers to the current line and the character "$" to the last line in the buffer.

So $ already meant "the end of the buffer", and Ken adapted it to mean "the end of the line" in regexes.

Okay, but then why did Deutsch and Lampson use $ for "end of buffer"?

Things get tenuous

The QED paper mentions they wrote it for the SDS-930 mainframe. Wikipedia claims (without references2) that the SDS-930 used a Teletype Model 35 as input devices. The only information I can find about the model 35 is this sales brochure, which has a blurry picture of the keyboard:

Blurry picture of a keyboard

I squinted at it really hard and saw that it's missing the []{}\|^_@~ symbols. Of the remaining symbols, $ is by far the most "useless": up until programming it exclusively meant "dollars", whereas even something like # meant three different things. But also, $ is so important in business that every typewriter has one. So it's a natural pick as the "spare symbol".

Yes this is really tenuous and I'm not happy with it, but it's the best answer I got.

If we're willing to stick with it, we can also use it to explain why Ken chose ^ to mean "beginning of line". ^ isn't used in American English, and the only reason QED wasn't using it was because it wasn't on the Teletype Model 35. But Ken's keyboard did have ^, even when it wasn't standardized at the time, so he was able to use it.

(Why did it have ^? My best guess is that's because ASCII-67 included it as a diacritic and keyboards were just starting to include all of the ASCII characters. The Teletype 35 brochure says "it follows ASCII" but didn't include many of the symbols, just uses the encoding format.)

So there you have it, an explanation for the regex anchors that kinda makes sense. Remember, April Cools next week!


  1. I had this question as part of a larger rabbit hole on the history of regular expressions, which might become a full blog post at some point in May. 

  2. I have a pathological hatred of secondary sources. 

What if the spec doesn't match the code?

19 March 2024 at 16:23

Whenever I talk about formal methods, I get the same question:

Can I use the spec to generate my code?

People are worried about two things. One, that they'll make a mistake implementing the specification and have bugs. Two, that over time the implementation will "drift" and people won't update the spec, leading to unexpected design issues.

These are valid fears, and both are things that happen when writing specifications! Which is why it extra sucks that my answer is usually "you can't". This newsletter will be a quick explanation as to why, plus why it's still a net win to write a spec, and an alternate trick you can do instead to keep things in sync.

Synchronization is too expensive

There's two ways we can keep code in sync with a spec. The first way is program extraction, where you take a spec and generate a complex program from it. Lots of theorem provers, like Coq and Agda, have this functionality. The second way is refinement, where you write code independently and then prove it somehow matches the spec. This is what the seL4 people did with their C code.

Extractable/refineable spec are much, much harder to write than an "unsynced" spec because you lose a level of abstraction. Specifications can exist at a higher level of abstraction than a program can, which is why they can be so much more concise and focused than a program can. For example, here's a quick PlusCal snippet1 for a reader pulling a value off a queue and adding it to that reader's total:

await Nonempty(queue);
with r \in Readers do
     total[r] := total[r] + Head(queue);
     queue := Tail(queue);
end with;

What would that look like in a program? You'd probably have code for setting up the readers, code for reading from the queue, code for persisting the value of total somewhere. If it's in another system like Postgres or redis, you'd have a client library for that, too. And around all of that you'd have logging, heartbeats, event hooks, a whole bunch of orthogonal concerns.

You might not even want to implement the whole spec! This is the machine vs world dichotomy: the spec can encode both the system you're designing and also the environment around that system. Let's say you're using AWS SQS for the queue, so there's a chance you can't connect to it, and a chance it duplicates a message. You can't implement either of those things in your system, only the code that detects and reacts to them. But adding those are par for the course in specification. It might look like this:

await Nonempty(queue);
with r \in Readers do
    either
        \* unreachable!
        status[r] := "error";
    or
        \* duplicate!
        total[r] := total[r] + Head(queue);
    or
        \* happy path
        total[r] := total[r] + Head(queue);
        queue := Tail(queue);
    end either;
end with;

To synchronize spec and code, you need to bring the spec closer to the level of code, which means losing a lot of abstraction power. It's definitely possible, but considerably more expensive. Synchronization is not affordable for most companies.

But specs are still worth it

Even if you can't keep the spec in sync, it's still a net benefit. Consider what happens if you have a bug in production:

  • With a spec: either you didn't implement the design correctly, or an assumption in the design ("the queue is perfectly reliable") was invalid
  • Without a spec: you didn't implement the design correctly, an assumption in the design was invalid, or the design itself has a flaw.

That last one is the most dangerous and the most expensive to fix — this talk estimated that one design flaw cost their team a year of rework. They're also really hard to correctly diagnose without a spec. I've often seen people think a design bug was actually an implementation bug. They'd "fix" the bug and it'd go away for just long enough for everybody to forget about it before suddenly resurfacing.

And then you gotta make sure that the new design isn't buggy, either.

That's why formal methods are a net win, even without syncing with code. With a verified design you can build systems faster and you can fix bugs faster, because you're confident that the bugs you got are "only" implementation errors.2

We're working on it

A confession: the "this is impossible" section was glazing over a lot of interesting work.

First of all, cheaply generating code from specs might not be viable in general, but it's possible for specific subsets of specification. This is mostly the realm of research projects like PGo, which converts (a variant of) PlusCal to Go code. You can see some examples here. I don't think it's going to be production ready for a long time, but at least it's being worked on.

A more exciting approach is generating tests from specs. I know a few companies who've done this, with the most public example being MongoDB. They discuss their approach in eXtreme modeling in practice:

We wrote a new TLA+ specification for this purpose and used the state space explored by the TLC model checker to generate C++ and Golang test cases for aspects of the OT algorithm. These test cases can be automatically generated and run after any change to the model, and ensure that both the C++ and Golang implementations make appropriate changes to maintain equivalence.

More specifically, they took a TLA+ spec and dumped the state graph, then decomposed it into a set of linear behaviors and converted those into about 5,000 tests. This works better than code generation because tests are, in a strange way, "more abstract" than code. Lots of implementations pass the same test suite. So it's less of a jump.

The TLA+ model checker has a bunch of features to make test generation easier: simulation mode lets you grab random traces and ALIAS can fine-tune the trace outputs. There's also an IO extension for calling subprocesses from model checking, which can be used to drive a test harness.

Two important caveats. One, the MongoDB team wrote a bunch of custom code to make this work; there's no off-the-shelf solution for turning specs into tests. Second, they were specifying a low-level, deterministic program, so were able to keep the TLA+ relatively close to the code. Writing a test harness for a distributed system is a lot harder. It's still a promising avenue!


Comments Section

I got a lot of responses to my graph types essay. I compiled some of the most interesting ones into a "Comments Section", here. There's some cool stuff about datalog and Rubik's Cubes there!


  1. PlusCal is a DSL that compiles to pure TLA+. I'm using it here because the equivalent TLA+ constructs are a little harder for outsiders to read. 

  2. One weak point in this is the "An assumption in the design was wrong" problem, but generally it's much easier to adapt a working design that fails under one broken assumption than an ad hoc design that might not have worked in the first place. 

What Mob Programming is Bad At

14 March 2024 at 14:45

Pairing is two people working together to write code, while mobbing is three or more. Pairing has been part of the programming milleau since at least the 90's (with extreme programming), while mobbing is more of a 10's thing. I'm going to use them interchangeably from here on out even though they're not interchangeable, because I think everything I'll say applies to both.

Proponents of pairing/mobbing attribute all sorts of benefits to it: higher quality code, more shared ownership, no more need for code review, even (in the most extreme claims) eliminating bugs entirely. Naysayers claim it slows team velocity down, is emotionally exhausting, and only works with very particular personalities. To which the advocates respond that if mobbing doesn't work, it's a problem with your company culture.

And things kind of go in circles from there, because you can't go anywhere from "it's a problem with the culture." I've made this same mistake in the past too! "Company culture" is the "wizard did it" of software techniques: you can explain the success or failure of anything with "culture". If you think the only possible problem with your $THING is "bad culture", then you're saying $THING doesn't have any problems, which means it's snake oil.1

Now overall I'm mildly pro-pairing. The last couple of places I worked at paired regularly, and when I consult with clients, I "mob specify" as much of the project as possible. This is because pairing is really, really good at knowledge sharing. It's a great way to get people up to speed with a new language, or domain, or even a section of a codebase. I even did a day of mobbing with Hunter Industries to understand it better.2 So while I think the loudest people online make overblown claims, I still mentally treat pairing/mobbing as "underleveraged techniques" along with stuff like formal methods and metamorphic testing.

So here's one thing I just cannot make work with pairing: optimization.

I am always better off optimizing code on my own than as part of a mob. I noticed this a long time ago, but I think I finally figured out how to explain the reasons why.

Bulldozers and lightning bolts

In practice [mobbing] usually feels like a bulldozer rather than a racecar - unstoppable and thorough. (source)

Bulldozers are "unstoppable" because they have high momentum, which is a function of both how heavy something is and how fast it's going. So if mobbing is like a bulldozer, then it's low-velocity, high-momentum. The mob writes less code than a bunch of independent developers but there's a lot of weight behind the code. It's more likely to be "good" code.

Something with a lot of momentum isn't just hard to stop, though. It's also hard to change direction. And in optimization I don't know the right direction. We can't easily predict how an attempted optimization will affect the overall program performance. If I see twelve ways of making something faster, six might have no effect, two could have negative effects, two could lead to minor improvements, and two could lead to order-of-magnitude improvements as long as you didn't do the minor optimizations. That's why benchmarking impact is so important!

The best way out would be to try all 4096 possible combinations of optimizations and then throw away 4095 of them. Lots of code written, lots of code thrown away. I want high-velocity, low-momentum. Instead of feeling like a bulldozer, my mind needs to feel like this:

Image of a lightning bolt

The path to fast code is a stepped leader, constantly changing direction, with dead ends branching off like streamers. Sometimes the dead ends can be just as long as the final path but still go nowhere. You write a lot of code, you write a lot of deep code, and then you throw it all away.

That start-stop motion doesn't work in group programming. It can if I know exactly what optimizations we need to make and it's just a matter of making them. That'd be something like fixing an N+1 query or replacing one algorithm with another. Then we have a direction and momentum is helpful.

Otherwise I much prefer to figure out optimizations on my own and then take my conclusions back to the pair to explain. Mobbing is good for knowledge sharing!

Generalizing this

Now you might argue "mobs throw away code all of the time." The difference is that it's "short-generation". They might throw away a few minutes of code but they're worse at throwing away a couple hours of code, which is sometimes necessary in optimization. The pattern I see is that mobbing is at its worst when code has

  1. A low chance of "surviving" to prod
  2. A long lifespan, even if never survives

What are some other places like this? Exploratory programming. Some kinds of deep bugfixing and refactoring. Local refactors ("pull this into its own function/class") are easy and common in the mob. Complex refactors ("redraw the separation of concerns for these four modules") are harder. Anything where progress happens in sudden jumps and lurches than through a continuous motion.

(You can adapt some pairing practices to handle these better, but IMO that's sensible for people who already prefer working in a group. If you don't pair or only pair part-time, this is something that you should keep doing solo.)

If you disagree come fite me

I did some searching for articles on how to pair/mob on optimizations and couldn't find anything. So if you think that no mobbing is great for writing performant code, don't email me, put it up somewhere online and then email me.

Also, if you tried pairing or mobbing and it didn't work, you should also put that somewhere online. Right now all of the people who are writing publicly about it are evangelists.


  1. Etymology time! The term "snake oil" doesn't come from people selling snake oil as a miracle cure, but from people selling snake oil that was actually mineral oil. "Snake oil doesn't cure anything" comes way later than the idiom. 

  2. Hunter industries is usually wheeled out in these discussions as the model mob programming culture. They're pretty cool people, I'd def manage . 

How to argue for something without any scientific evidence

6 March 2024 at 21:27

Last week I got this interesting question:

I want to write a book about automated testing. Much of the book would be me explaining the best practices I’ve learned. I know these are good practices; I’ve seen them work over and over again. But have no [scientific] data at all to support my conclusions.

My question: lacking such data, how can I talk about something important like testing in an authoritative way? — Jake

This is near and dear to my heart. On one hand, I think empirical software engineering (ESE) is really important and have given several talks on it. On the other, I think ESE's often a trainwreck and find most of the research flawed. And my career is based on formal specification, which hasn't been studied at all.

Thing is, software still needs to get built regardless of how much we know about it. So how do you get people to believe you when there's no science backing you up?

I've written before that you need two things to be considered trustworthy:

  1. Trustworthy expertise: your judgement is correct.
  2. Trustworthy opinions: you aren't deceptive about your judgement.

Empirical evidence is a shortcut to both of these: we tend to think the scientific process is perfectly objective (it's not) and consider it unbiased (we shouldn't). At the very least, it's on average more objective and less biased than most other forms of gathering information. Without it, we need to work harder to earn people's trust.

Trustworthiness isn't the same as authority, but they're closely linked. You can't be an authority without being trustworthy.

ACOUP plug

One of my blogs is A Collection of Unmitigated Pedantry, about the intersection of history and pop culture. I've written more about it here. I want to call out one section of this essay:1

Soldiers going into battle often look to their leaders to perform leadership. After all, the infantryman on the line has little sense if the battleplan is good or not ..., but he’s likely to be reassured if his leaders look like how he imagines good leadership to look. … seeing their king performing that role with excellence could be confidence building – “I may not understand the plan, but we have a good general and thus a good plan.”

Authority is performative. So if you want to talk about something like an authority, you need to perform being an authority. Do things that people expect an authority is able to do. I can't tell you the best way to perform authority, but I can tell you how I try perform it, and you can judge if that'd work for you too.

How I try to perform authority

  1. Be up-front about my biases.
  2. Demonstrate deep familiarity with the field.
  3. Talk about the benefits and drawbacks.
  4. Talk about alternatives in a deep, honest way. Say what they do worse, but also what they do better, and in what circumstances.
  5. Discuss variants, that are still $THING but not how I do $THING.
  6. Discuss common arguments against, rejecting some as invalid, acknowledging some as valid, why it's good despite those valid arguments.
  7. Argue why, on the whole, I still take the position I do.

Let's try applying these principles to a testing topic like Test-Driven Development. I've written about my complicated feelings before but not written something trying to convince people to use it.2 If I had to, this would be my first-draft outline:

  1. Define test-driven development, both the general principle (write tests before code) and the specific form (strict red-green-refactor cycles).
  2. Give a persuasive example of a case where TDD performs best: adding regression tests to fix a bug in code.
  3. A more involved example of "greenfield development" with TDD. This would probably look like writing two modules, one with TDD and one where I write all of the tests at the end, and compare the two. Show where the two approaches lead to the same tests and where they lead to different tests.
  4. Claimed merits of TDD: leads to better code, low coupling/high cohesion, "design pressure", etc. Mention which ones I believe, which ones I don't believe, which I think require more nuance, etc.
  5. Common arguments against TDD: leads to "TDD code", slows you down, plenty of great software is written without it, etc. Respond to these in ways that show I respect the people making these claims even if I disagree with them. 1. Go deeper into the class of "types/integration tests/formal methods are better" arguments. Use this to talk about defense-in-depth.
  6. Alternatives: writing tests at the end, having a separate QA team, cleanroom, not testing at all, etc. Show cases where TDD is better than these and cases where it's worth. Argue that the former are more common. 1. Talk about where I personally don't do TDD.
  7. Common variants still in the TDD umbrella: behavior-driven development, acceptance-test driven development, test && commit || revert
  8. Further reading. I would not recommend the original book because I haven't read it. That might undermine my expertise but it supports my honesty. If I really wanted to come across as an authority, I'd read that book first before writing my spiel.

(This looks like an awful lot of work, but that's mostly because TDD is such a huge and controversial topic, so I have to do a lot of work to perform authority. If I focused on a specific thing, like "mocks or not" or "should you test private methods", it'd all be a lot easier.)

The goal here is to "show my work" and demonstrate that 1) I'm well-versed in TDD, 2) I understand the flaws of TDD, 3) I genuinely value TDD anyway, 4) that my opinions are motivated by deeper engineering principles. Even if someone disagrees with my opinions, they still gain knowledge and don't leave thinking that I'm trying to hoodwink them.

There are other ways to perform authority

One thing I do a lot is showcase advanced knowledge on something that most practitioners will never need, but can appreciate the expertise behind it. Like it one's thing to show I know the basics of TLA+, another thing to show I know about hyperproperties and refinement. I don't know what the equivalent for TDD would be, maybe testing statistical properties?

In writing this I notice that this is specifically about performing trustworthy expertise. What would performing trustworthy opinions look like?

If someone comes in demanding empirical evidence

First ask yourself: are you making the kinds of radical claims that require lots of scientific backing? Not "TDD/Formal methods is good", but "TDD is equivalent to hand-washing" or "if you don't use formal methods you're unprofessional." If so, then you're not performing authority well and people are picking up on that. Dial it down.

Second: try to find out if they genuinely care about empirical software engineering. Are they interrogating their own practices under the same lens? If not, then they're just using ESE as a cheap gotcha and wouldn't change their mind even if you have studies.

Otherwise, talk about the paucity of ESE knowledge in general, raise whatever few papers you do know about, say they're suggestive and limited and more research is needed, etc etc. Show that you've at least thought about the scientific side of things.


Consulting stuff

My current gigs are winding down so I'm looking for new consulting work in formal methods (TLA+ and Alloy). Details here.


  1. He covers this in more detail in Commanding Pre-Modern Armies but that's less quotable 

  2. At least not since 2017, and I like to think my writing and software skills have improved in the past six years. 

Know (of) the right tool for the job

27 February 2024 at 18:26

Last week's Picat essay did pretty well online! One of the common responses was "This looks really interesting but I don't know where I'd use it."

For Picat there was a gap of five years between me hearing about it and actually learning it. The same is true for most of the weird stuff I encounter: I keep it in the back of my head for years until I finally run into the problem which justifies using it. I think this is an important programming skill: being able to recognize when a tool you know of but don't know solves a problem you have.

It's the opposite of Paul Graham's Blub paradox:

Blub falls right in the middle of the abstractness continuum. … when our hypothetical Blub programmer looks in the other direction, up the power continuum, he doesn't realize he's looking up. What he sees are merely weird languages. He probably considers them about equivalent in power to Blub, but with all this other hairy stuff thrown in as well. Blub is good enough for him, because he thinks in Blub.

When we switch to the point of view of a programmer using any of the languages higher up the power continuum, however, we find that he in turn looks down upon Blub. How can you get anything done in Blub? It doesn't even have y.

I always thought it was funny that Paul ends this with "and so Lisp is the most powerful language", as if he's not himself a Blub programmer. Regardless, this is definitely a real paradox I've experienced. A programmer once told me that Chapel was just a Julia knockoff, when he had never seen Chapel before and only studied Julia for a week. It's just really hard to think of how things you don't already know might help you. But if you can do this, you're more likely to learn the right tool for the job instead of sticking with the wrong one.1

Some Examples

I categorized Picat as "logic + imperative + constraint solving". I figured it'd be useful if I needed to do one of those things with a sprinkle of another. In the end I was doing a constraint problem and needed "a more expressive constraint solver", which means constraints + a sprinkling of LP. So I learned Picat and it solved the problem. As part of exploring the language I discovered planning, which is now another tool on the backburner for when I hit the right problem.

Sometimes it doesn't work out. Either the unknown tool doesn't actually solve my specific problem or it works in theory but has a problem in practice. At one point I wanted to analyze a state space graph and remembered that Gephi was a graph analysis software suite. But when I actually learned it, I found it was too limited in capabilities to be useful.

That's one thing that worked and one thing that didn't. Something that I haven't tried yet is lisps, in particular macros. I regularly run into problems which seem solvable with macros:

  1. A case statement that automatically raises an error if nothing matches
  2. A special if statement that treats domain-specific values as true or false, which my language doesn't support
  3. Defining multiple functions that all share a specific set of keyword parameters

Do any of these justify learning a lisp? Nah. They might even be terrible ideas in practice! But at least it keeps me thinking about the uses of tools I don't know.

How to develop the skill

Well the first step is to read about lots of different software topics. That's not enough, because you can go on a grand sneer, so you have to be open to thinking about how those topics could augment what you already know.

Past that I have no idea. I think my experience comes from being a teacher. Whenever I learn of a new topic I think about what motivating examples I'd use to teach it, and whenever I come across something "interesting"2 in real life I pattern-match against a bunch of topics to see if it'd make a good example. I just find explaining things fun.


Bonus: LLMs for querying data

As part of the graph project I wanted to see if Neo4j was getting more or less popular. I figured one good proxy would be the number of StackOverflow questions asked per month. It's imperfect (maybe fewer questions = lowest fruit already picked) but good enough for my purposes. StackOverflow's data is public. My blocker is the same thing with all sources of public data: I don't know the schema! But maybe GPT4 does:

Give me a query that returns the number of stack overflow questions tagged "neo4j", by month. Use the schema in the "stackexchange data explorer"

SELECT
    FORMAT(CreationDate, 'yyyy-MM') AS [Month],
    COUNT(*) AS [Number of Neo4j Questions]
FROM
    Posts
WHERE
    Tags LIKE '%<neo4j>%'
    AND PostTypeId = 1  -- 1 indicates a question
GROUP BY
    FORMAT(CreationDate, 'yyyy-MM')
ORDER BY
    FORMAT(CreationDate, 'yyyy-MM') ASC;

The result worked! This could be a great use of LLMs! Drunk with power, I tried it the final boss of public datasets, wikidata:

Give me a wikidata query for "number of dogs for each dog breed". No explanation, just the query.

SELECT ?breed ?breedLabel (COUNT(?dog) AS ?numberOfDogs) WHERE {
  ?dog wdt:P31 wd:Q144;
       wdt:P31/wdt:P279* ?breed.
  ?breed wdt:P279* wd:Q144.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
GROUP BY ?breed ?breedLabel

This worked... less well. It doesn't deduplicate dogs, and it gets all dogs of all types, not breeds. We actually want the line ?breed wdt:P31 wd:Q39367.3 Other experiments show that GPT often gets the wikidata tags wrong, sometimes giving stuff that doesn't correspond to anything.

That said, I was still able to kludge something mostly workable in ten minutes by modifying the LLM output, which is still a lot faster than I could have done on my own. This follows my general heuristic for applying AIs: use it on things that are hard to do, easy to check, and easy to fix.


  1. Though using the wrong tool can work just fine. And sometimes it's not worth the time to learn something you don't know if it'll only be slightly better than what you have now. 

  2. You know, flight layovers, self-checkout machines, rat-shaped potholes, bank teller queues... 

  3. This doesn't get all of the data either, but this is a problem with Wikidata and not the LLM. Many specific breeds are listed as instances of dog type (Q38672), which also includes things like "hunting dog" and "sighthound". 

Planning vs Model Checking

20 February 2024 at 23:16

New blogpost! Planner programming blows my mind, Patreon here. Next essay out should be the graph project.

The post is about Picat and more specifically, planner programming. Planning very roughly is:

  1. You provide a start state, a set of goals, and a set of state transitions (actions),
  2. The planner uses the actions to generate a state space,
  3. The planner reports a sequence of actions that reaches a goal, if any.

A lot of formal methods stuff like TLA+ is verified with model checking, which very roughly is:

  1. You provide initial states, a set of properties, and a set of state transitions
  2. The model checker uses the transitions to generate a state space,
  3. The checker reports a sequence of actions that violate a property, if any.

These look extremely similar, and in fact they are mathematically the same thing. Any model checker can find a plan that reaches goal P by writing !P as a property to check. Then the "violation" returns a trace that ends with P. That's how you can use TLA+ to solve the diehard puzzle.

Similarly, any planner can be turned into a model checker that by making the goal !P, and then failure to reach that goal means P is a valid property. And in fact that's how the Z3 theorem prover proves properties: to determine if P is true, you check that !P has no satisfying solutions.1 I believe Alloy does the same thing, running as both a "model finder" and a "model checker".

Planning and model checking are "isomorphic" to each other, a tool for one is identically a tool for the other. Any differences between the two are in the engineering of planners vs model checkers. But these differences are quite significant, because the approaches come from completely different fields. Planning comes from robotics and AI, and (to my understanding) is predominantly used in video games. Model checking comes from formal methods. These fields are making the same class of tools but for very different reasons, and this leads to a lot of practical differences in what these tools do.

Planning

(I have limited experience with planning and am basing most of this off this page, PDDL, and some github research.)

Planning is mostly done for finding a plan to be immediately executed by a robot or game AI. There's three complications to planning problems not seen in model checking:

  1. It's not enough to find any plan, you want to find the cheapest plan. So planners can assign a "cost" to each action and then find plans with minimum costs. Most model checkers, if they do minimization, only minimize the number of steps in a trace.
  2. The planning may need to be run for many different agents in realtime. There's more emphasis on being able to tune the planner's performance: swap in a different search strategy, fine tune cost heuristics, etc.
  3. Planning can be "dynamic", in that world-state can change while an agent is undergoing a plan, in which case it needs to make a new plan. I don't know how this is generally handled.

Model checking

Model checking is mostly done for finding a bug in software to be fixed. If a model check finds a bug, someone has to read the error trace and then try to make changes to the software to fix it. So you see more emphasis on making legible traces, such as by generating visualizations.

There's a lot of different kinds of bugs in a system, so model checkers offer a lot of flexibility in expressing properties. Most of the planners I found look for end states;2 these correspond to "invariants" in a model checker, properties that hold true at all states. By contrast, most model checkers also accept a wide range of temporal operators, giving properties over sequences of states. Examples are:

  • The counter can only increase, never decrease.
  • The boot sequence cannot start until the switch is flipped, and once it starts, it must complete within five seconds.
  • The database is eventually consistent.
  • Every thread always gets a chance to make progress on its task.

Many model checkers also have interactive or simulation modes, to help programmers better understand the systems without them having to check for specific properties.

Two thoughts on this

Commands vs Queries

Best explained here. Model checking is primarily query-centric (you use it to gather information about the system), planning is primarily command-centric (you use it to set off a bunch of state changes).3 This influences how the tools are engineered: simulation makes sense for gathering information but not effecting change; effecting change means running lots of plans at once, etc.

Engineering matters

Two things can have the same underlying theory but be engineered so differently that one is a poor substitute for the other. Not just in features but performance characteristics, debuggability, etc. Neither is making the wrong choice, both are just adapted to a context. This is a cause of software mimicry.


  1. Technically z3 isn't a model checker so the analogy doesn't fit, but it's too cool a fact not to include here. 

  2. PDDL 3 can express temporal preferenes but adoption is limited. 

  3. Not 100% of the time, of course! In the blog post I use planning to query a constraint problem. 

Strings do too many things

8 February 2024 at 15:30

No Newsletter next week

TLA+ Workshop and moving places.

Strings do too many things

In the unusual basis types email1 I wrote this about strings:

We use strings for identifiers, human writing, structured data, and grammars. If you instead use symbols for identifiers then you can be more confident a given string isn't an identifier.

I think this is a useful concept and want to explore it a bit more. Consider the following pseudocode:

query = "INSERT INTO " + tablenames["comments"] + " VALUES ('f1rst p0st', 'john@example.com');"

In that line,

  • "comments" is an identifier. We don't care about anything besides that it's distinguishable from other identifiers. This category also includes enumerations, like when the states of a light are ("red", "yellow", "green").
  • "f1rst p0st" is human writing. It means nothing to the system and something to the people that interact with the system. Most text in the real world is human writing.
  • "john@example.com" is structured data. We store it because this database doesn't have a special "email" type. In a programming language we'd instead store it as Email(local="john", domain="example.com").2
  • The entire query is a grammar. It's a snippet of the SQL language. For that matter, the whole line is also a piece of grammar which we store in the program's text file.

(The line between "structured data" and "grammar" is really fuzzy; is a CSV data or grammar? Maybe making a distinction isn't useful, but they feel different to me)

When you see a string in code, you want to know what kind of string it is. We use these strings for different purposes and we want to do different things to them. We might want to upcase or downcase identifiers for normalization purposes, but we don't split or find substrings in them. But you can do those operations anyway because all operations are available to all strings. It's like how if you store user ids as integers, you can take the average of two ids. The burden of using strings properly is on the developer.

Almost all programming languages have an ASCII grammar but they need to support Unicode strings, because human writing needs Unicode. How do you lexographically sort a list of records when one of them starts with "𐎄𐎐𐎛𐎍"?

Other problems

I think the conflation of identifiers and strings is "merely" "annoying" in that it adds more cognitive burden.3 The real problems are with data and grammars.

  1. There's no way to tell a string's nature without context. Is "John wears a hat" human writing, or a kind of data triple, or some kind of unusual source code? Is the triple "{wears} {a hat}" or "{wears a} {hat}"?
  2. Strings of one nature can contain strings of another nature. element=<pre>2024-02-07,hello world</pre> has structured data inside a grammar inside a grammar inside a grammar. And it's all inside human writing inside a markdown grammar that'll be parsed by buttondown's email generator. I will be astounded if it renders properly for everyone.
    • We often want to store human writing in data/grammars, which means bolting an extra grammar on top of writing. JSON can't contain multiline strings; is the value "Thank you,\nHillel" a string with a linebreak or a string with the text \n?
  3. Grammars and structured data can be written incorrectly. If the program misunderstands the nature of string, valid data can be transformed into invalid data. Lots of simple programs break on Windows because they interpret the path separate (\) as an escape sequence.
  4. Security: SQL injection is possible because the fragment "WHERE id="+id expects id to contain an identifier, but someone gets it to contain SQL code instead.

The bigger problem: serialization

Given the problems with strings doing too much, the direct approach is to offload stuff from strings. Some languages have native types like symbols; otherwise, you can make special types for identifiers and each different kind of data. Grammars are trickier to offload; I usually see this done as composable functions or methods.

Then you get the problem of getting information out of the program. Files are strings. HTTP messages are strings. Even things like booleans and numbers now are strings. Strings are the lowest common denominator because all languages understand strings. C doesn't know what an abstract base class is but it does know what a string is.

I guess for interprocess communication you've at least got stuff like protobuf and thrift. Otherwise, you could try throwing another grammar in the mix? Instead of storing the email as "john@example.com", store it as

{
  "type": "email",
  "value": "john@example.com"
}

See also: JSON Schema.

Anyway this is all just another example of "push errors to the boundaries of your code". The real world might be stringly typed, but your program internals don't have to be.


  1. Thanks to everybody who sent me other unusual basis types! And the people who pointed out that I forgot sum types as a (near) universal. Feel like an idiot for missing that one 

  2. This is just a demonstrative example, in the real world A B <@C :"D E"@F -G-H!> is a valid email. 

  3. And while writing this I thought of a really annoying case with identifiers as strings: I've worked with a few APIs that take options as a string. Some read "foo bar" as the two identifiers foo and bar, and some read "foo bar" as the single identifier foo bar

What does 'TLA+' mean, anyway

30 January 2024 at 18:08

TLA+ Workshop

Feb 12th. I've decided to reduce the class size from 20 to 15, so there's only a couple of slots left! I'll be making a little less money this way but it should lead to a better teaching experience for the attendees. Use the code NEWSLETTERDISCOUNT for $100 off.

What does "TLA+" mean, anyway

It's a running joke in the TLA+ community that we're afraid of saying what "TLA+" means, lest it scares people off. It's not on the front page of Lamport's Website, or on the github repo, or on learntla (though I put it in the FAQ). So here we go:

TLA+ stands for "Temporal Logic of Actions, Plus".

Okay but that doesn't explain anything. So let's talk about what that actually means. I'm going to oversimplify both the history and math to avoid spending two more days on fact checking.

Modal Logics

A "logic" is a mathematical framework for reasoning about truth statements, like how "arithmetic" is a framework for reasoning about numbers. There are lots of logic but the one we're all familiar with is propositional logic, which is just manipulating raw true-false values. Propositional logic is what gives us stuff like

!(x || !y) == !x && y

Which we use to simplify our conditionals. Then comes predicate logic, where you add in the ability to say "P(x) is true for all x" or "for some x." Predicate logic is used almost everywhere in math. I'm writing a short book/long pamphlet on predicate logic, I think it's underused in programming.

Less common are the various modal logics, which is add "modes", or qualifiers. □P means that "P is necessarily true" and ◇P means that "P is possibly true". Note that ◇P is defined as !□!P: P is possibly true if it's not necessarily false.

What does "necessarily true" mean? Ah, there's an interesting question. "Necessity" means different things in different contexts:

  • When talking about "possible worlds", □P could mean that P is true in all possible circumstances, while ◇P is true in some reality. □(2+2=4), ◇(unicorns are real).
  • When talking about proof, □P could mean that we have evidence that P is true, while ◇P means we don't have evidence that P is false. □(unicorns don't exist), ◇(aliens exist).
  • When talking about ethics, □P could mean that P is something you must do, while ◇P is something we may do. □(don't punch babies), ◇(kick unicorns).

A cool one I ran into at SPLASH 2021 was belief programming for drones. Let's say you have an altimeter that's accurate to within 10 meters. If it reads that you're 100 meters off the ground, then □(height >= 90) while ◇(height >= 109).

In formal methods, the most interesting mode is time, because you can use that to mode to represent algorithms.

Temporal Logics

Say that "necessarily P" means "P is always true" and "possibly P" means "P is sometimes true", and we get temporal logic. Or, more precisely, a temporal logic, because there's quite a few. One of the big questions comes from whether to model a coin flip as two totally separate timelines or a single timeline that branches in two. This matters because they lead to different interpretations of "sometimes". If you flip a coin once, is ◇heads true?

  1. Linear model: No, because there's two timelines, and in one of them, you flip tails.
  2. Branching model: Yes, because there's one timeline that branches, and in one branch you flip heads.

In 1980 Lamport wrote the influential paper "Sometime" is sometimes "no never" where he argued that linear models are better for modeling concurrent programs.1 The next year Clarke and Emerson would announce the first model checker, which checked a branching time logic. So both approaches are really important in the history of formal methods, but Lamport preferred the linear time, which decided the rest of this story.

Anyway, the dominant linear model of temporal logic is called... linear temporal logic.

Linear Temporal Logic

Time is broken down into a sequence of abstract steps, with three main temporal operators:

  1. G/□: true now and in all future steps.
  2. F/◇: true now or in at least one future step.
  3. N/X/◯: true in the next step.

(All of them have ASCII alternatives because computer scientists hate weird symbols as much as other mathematicians love them. Also there's some more operators like Until and Release that aren't strictly necessary so I'm leaving them out.)

Here's how we can model "x is true in this state and false in the next state":

x && N(!x) 

Here's the boolean flipping between true and false in the next step:

(x && N(!x)) || (!x && N(x))

Here's a boolean that starts true and flips every step:

Flip(x) = (x && N(!x)) || (!x && N(x))
Spec = bool && G(Flip(bool))

Here's a bicycle blinker which turns on when you click it once, blinks while it's on, and turns off when you click it again.

Flip(x) = (x && N(!x)) || (!x && N(x))
Init = !light && !active
TurnOn = !active && Flip(active) && N(light)
TurnOff = active && Flip(active) && N(!light)

Spec = Init && G(
    TurnOn
||  TurnOff
||  (active && N(active) && Flip(light))
)

I think I got that right! There's programs that convert LTL fragments if you feel like showing how I'm wrong.

Action Logics

LTL is arguably the most widespread temporal logic, and lots of formal methods languages use it for expressing properties. But it's not commonly used for expressing entire systems. There's lots of very specialized reasons for this and the most easily-explainable one is that updating variables sucks. Say x is an integer you want to increment. You'd want to write something like N(x) = x + 1, but you can't do that, N(x) returns a boolean. You could also try N(x = x + 1), but that doesn't work because both x's in the N refer to the next value of x. There's no way to get one side of the = to be the current value of x and other side to be the next value.

Instead you have to do something like this:

some z: Int | (z = x + 1) && N(x = z)

Oooooor we could just define x' to be the "the value of x in the next state"

x' = x + 1

You read x' as "x-prime". Using prime to mean "the next value" goes at least back to Turing, but mathematicians and physicists were using it in the same-ish way for far longer.

Boolean formulae with primes are called actions. Lamport claims to be the first person to make an action-based temporal logic and I haven't found any information that refutes him. Epistemically ◇(he was the first), though I can't assert □(he was the first).

Anyway, Lamport published his new logic in 1990 and then more thoroughly in 1994. He called it the "Temporal Logic of Actions", or TLA.

PLUS

TLA is the formal logic: what model of time we're using, what □ and ◇ mean, what makes a valid formula, etc. TLA+ is the specification language built on top of the formal logic. It adds things like conditionals, sequences, integer arithmetic, etc. You could design a totally different specification logic using the same TLA core. Or you could embed just the TLA stuff in a theorem prover and leave out the + stuff.


  1. ¿Por qué no los dos? You can have a model that's contains both linear and branching timelines, like CTL*. The main downside is that it leads to much more difficult computational problems. Satisfiability, for example, is 2EXPTIME-complete

GitHub Search for research and learning

24 January 2024 at 16:24

Hi everyone!

I have a new blog post out: An RNG that runs in your brain. It's a mix of cool tricks and math analysis done with an exotic gremlin language. Patreon is here. Also TLA+ workshop on Feb 12 etc etc use the code NEWSLETTERDISCOUNT for $100 off etc

Anyway I've been all over the place this week wrt contracts, writing projects, and errands where no matter what I'm working on it feels like there's something more important that I should be doing instead. One of the projects I'm jumping between is the "Why don't we have graph types" writeup. I'm trying (trying) to get the first draft done by the end of the month, and as part of that I'm researching how people use graph libraries in production codebases. And that means a lot of time with GitHub code search.

Code search is amazing. First of all, it covers 200 million public repositories. Second, it can filter on language and filepath. Third, it supports regular expressions. To find files that use Python's graphlib package, I just need to search

/(from|import) graphlib/ language:python

After that I can sift through the results to separate real-life use cases from false positives and toy problems.

I also used search for another project I'm jumping between, the logic book. One programming trick you get from logic is that you can rewrite not (all x: P) or (some x: Q) as (some x: Q or not P), which is clearer and faster to evaluate.1 I wanted a real-life example of code that can be rewritten this way. So I searched

/all\((.*)\) or any\(.*\):$/ language:python

And got like 300 code samples that could be simplified. I used a similar search to get examples of nested quantifiers for a different topic in the book. Now if only I could stop researching and start writing...

Code search is potentially even more useful as an education tool. You can learn a lot about how something works by reading code that uses it. For example, let's say you're using hypothesis for property-based testing, and you want to get better at composites. You can find examples by searching

/(from|import) hypothesis/ composite language:python path:test_*.py

Oh yeah you can include multiple regexes. Want to find files that have two composites? Replace composite with /composite(.|\s)*composite/. Just about the only things that don't seem work are group matching and lookarounds like (?!).

So far I've just done searches from the website, but there's also a code search API. You could feasibly run a query, dump a hundred matches into files, and then do more complicated filtering and munging before spitting out a courseload's worth of case studies.

You can read more about the search syntax here. Try it out!


  1. It's even better when you can rewrite not P in a clear way! Like replacing not (x < 0) with x >= 0

Unusual basis types in programming languages

17 January 2024 at 16:00

TLA+ Workshop

TLA+ workshop on Feb 12! Learn how to find complex bugs in software systems before you start building them. I've been saying the code NEWSLETTERDISCOUNT gives $50 off, but that's wrong because I actually set it up for $100 off. Enjoy!

Unusual basis types in programming languages

Here are the five essential "basis" types in all modern programming languages:

  1. Booleans
  2. Numbers
  3. Strings
  4. Lists
  5. Structs

(I'm talking about these as general overarching concepts, so numbers includes doubles and integers, lists include arrays and tuples, etc. Languages often support several types in each category.)

These types are always supported with dedicated syntax, as opposed to being built out of the language abstractions. Some domains have additional basis types: systems languages have pointers, dynamic languages have key-value mappings, FP languages have first-class functions, etc. Outside of these, languages rarely add special syntax for other types, because syntax is precious and you can build everything out of structs anyway.

So it's pretty neat when it does happen. Here are some unusual basis types I've run into. This list isn't comprehensive or rigorous, just stuff I find interesting.

Sets

Seen in: Python, most formal methods languages

Sets are collections of unordered, unique elements. Sets are nice because you can add and subtract them from each other without worrying about duplicates. Many languages have them in the standard library but Python also gives them special syntax (shared with dictionaries).

# Python
>>> ["a", "b"] + ["a", "c"]
['a', 'b', 'a', 'c']
>>> {"a", "b"} | {"a", "c"}
{'c', 'b', 'a'}
>>> ["a", "b"] - ["b"]
TypeError: unsupported operand type(s) for -: 'list' and 'list'
>>> {"a", "b"} - {"b"}
{'a'}

Lots of things we use lists for are better represented with sets. Using a set tells the reader that ordering and duplicates don't matter to your program, which you can't always figure out from context. They're probably the thing I miss most in the average language, with symbols coming in a close second.

Symbols

Seen in: Lisps, Ruby, logic programming

Symbols are identifiers that are only equal to themselves. Most languages use strings for the same purpose, the difference being that strings can be manipulated while symbols can (in essence) only be compared.1

# Ruby
> puts "string1" + "string2"
string1string2
> puts :symbol1 + :symbol2
undefined method `+' for :symbol1:Symbol (NoMethodError)

So why use symbols if you already have strings? In Ruby it's because strings are mutable while symbols are not, meaning symbols are more space-efficient. This can really add up if you're making a lot of hash tables. The broader reason is that strings already do too many things. We use strings for identifiers, human writing, structured data, and grammars. If you instead use symbols for identifiers then you can be more confident a given string isn't an identifier.

Some logic programming languages— at least those that stem from Prolog— have symbols (called atoms) and not native strings. SWI-Prolog added proper strings in 2014.

Pairs

Seen in: Raku, Alloy kinda

Raku actually has a few different unusual primitives, but pairs are the one that's most generally useful.2 A pair is a single key-value mapping, such that a hash table is just a set of pairs.

> (1 => 2){1}
2 
> {"a" => 1, "b" => 2}.kv
(b 2 a 1)

I like pairs a lot. First of all, they satisfy the mathematician in me. You can pull the element out of a list, so why not pull the kv pair out of a mapping? Raku also does a lot of cool stuff with pairs. For example, :$x is shorthand for 'x' => $x and :foo is short for :foo => True. These make it easy to add optional flags to a function. I think keyword arguments are implemented as pairs, too.

Alloy also has sets of pairs ("relations") and a lot of cool ways to manipulate them. ^rel returns the transition closure of rel, while set <: rel returns only the pairs in rel that begin with elements of set.

N-arrays

Seen in: APLs (like Dyalog, J, BQN), maybe Matlab?

In most languages a "2D" array is just an array containing more arrays. In APLs, a 2D array is just an array with two dimensions. Unlike the list-of-lists model, there's no "preferred axis", and you can mess with columns just as easily as you mess with rows.

   NB. J
   ]x =: i. 4 4
 0  1  2  3
 4  5  6  7
 8  9 10 11
12 13 14 15

   NB. first row
   0 { x 
0 1 2 3

   NB. first column
   0 {"1 x
0 4 8 12

   NB. grab the middle
   (1 1 ,: 2 2) [;.0 x
5  6
9 10

Other things you can do: sort the rows by a specific column. Sort the columns by a specific row. Transpose the array so the rows are now columns. Reshape it into a 2x8 array. Multiply every element by 6.

Arrays can have any number of dimensions, though I find more than 2 dimensions awkward. If I have 3 or more I'm usually better off with a list of 2D arrays (via boxing). But 2D arrays are quite useful! Which is why the most popular 2D array by far is the dataframe.

Dataframes

Seen in: lil, nushell, Powershell, R (I think?)

Sort of a mix of a 2D array, a mapping, and a database table. Allow querying on both indexes (rows) and column names.

Python does not have dataframes as a basis type but pandas is the most popular dataframe library in use, so I'll use that for the demo:

import pandas as pd
df = pd.DataFrame({
    'airport': ['ORD', 'LHR', 'MEL'],
    'country': ['USA', 'UK', 'Australia'],
    'year': [1944, 1946, 1970]
})

>>> df['airport']
0    ORD
1    LHR
2    MEL

>>> df.iloc[1]
airport     LHR
country      UK
year       1946

>>> df[df['year']>1960]     
  airport    country  year
2     MEL  Australia  1970

Datatables are unsurprisingly extremely popular in data science, where you want to slice and dice data in all sorts of weird ways. But you also see it in some shell languages because being able to sort and filter ps/ls/grep is really really nice.

Things missing

  • I don't know any stack or assembly languages, and I didn't look at NoCode/LowCode environments or DSLs.
  • I bet there's a lot of unusual stuff in Mathematica, but I'm not paying $400 just to check.
  • There's a lot more basis types I've heard of in languages I don't know: Go channels, lua userdata, SNOBOL patterns, raw bytes and byte streams.
  • No syntax for graphs, which is unsurprising because there's no library support for graphs either. Working on the writeup of why now.

Feel free to email me with weird basis types you know about that I didn't list here. The criteria is pretty loose but the very broad qualifier is "is there syntax specifically for this type?"


  1. In essence. Ruby supports a bunch of methods like upcase on symbols, which first convert it to a string, then upcases it, then converts it back to a symbol. 

  2. The other really unusual type is the junction, which is a sort of superposition of boolean conditionals. They exist you can do if ($x == 3|5) instead of if $x == 3 or $x == 5

Why all([]) is true, prod([]) is 1, etc

9 January 2024 at 18:05

New Blogpost

The World and the Machine. It's about how to model systems better, and while it's written in TLA+ the advice should apply regardless of your speclang. Patreon is here.

Speaking of TLA+, I'm running a TLA+ workshop on Feb 12th! I'm real excited about this one because I wrote a new lesson on the TLA+ collection types which should make the whole process a bit smoother. Use the code NEWSLETTERDISCOUNT for $50 off.

Why all([]) is true, prod([]) is 1, etc1

This is inspired by this article. Some Python list-ops do unusual things with empty lists:

>>> all([])
True
>>> any([])
False
>>> math.prod([])
1
>>> sum([])
0

You can find in-depth explanations for each of these by itself, like the empty product. But all of these can be explained as consequences of the same mathematical principle. Let's say we have two lists xs and ys, and let + be list contatenation. We expect the following to always hold:

prod(xs + ys) = prod(xs) * prod(ys)

IE prod([2,4,6,8]) = prod([2, 4]) * prod([6, 8]).2 You can see why we'd want this to be true, as it means that prod behaves in a reasonable way regardless of how big our list is. But we also have an edge case:

prod(xs) 
  = prod(xs + [])
  = prod(xs) * prod([])

So for any xs we must have prod(xs) = prod(xs) * prod([]), which is only possible if prod([]) = 1. We can apply similar reasoning for other list reductions:

sum(xs) = sum(xs) + sum([]) = sum(xs) + 0
all(xs) = all(xs) and all([]) = all(xs) and True
any(xs) = any(xs) or any([]) = any(xs) or False

Simple, right?

But what about max?

Of course there's a problem. max(xs + ys) is obviously max(max(xs), max(ys)). So by the same reasoning we should have

max(xs) = max(max(xs), max([]))

But max([]) can't be any number. Let's say we decide it's 0, then we have

max([-1]) = max(max([-1]), 0) = 0

We can't define the maximum of an empty list! Then what's special about prod that makes an empty product possible when an empty max is not?

What's "special" is that prod has an identity: some value a such that x * a = a * x = x (1, for prod). The mathematical term for "a set of elements with an associative binary operation that has an identity" is a monoid. For any monoid, we can take the operation and create a version that applies it to a list, and then applying that to the empty list should return the identity:

listop(xs)
  = listop(xs) op listop([])
  = listop(xs) op iden
  = listop(xs)

The set of booleans form a monoid under or, which has the identity element False, so any([]) = False. max does not have an identity, so is not a monoid, so we can't meaningfully define it for an empty list.3

"A version that applies it to a list"...

...is just reduce. prod(xs) is reduce(*, xs, 1), any is reduce(or, xs, False), etc. Most languages require you to provide an initial value for the reduce (ie reduce(max, xs, xs[0]), but if you're reducing over a monoid then you shouldn't actually need it, as you can provide the identity as the default initial value.

And in fact this is (sort of) what Haskell does. In Haskell, foldl/foldr over lists require a default value, but if you know you're working over a monoid, you can use mconcat without one, as it can use the identity as its default.

>>> mconcat [1,2,3] :: Product Int
Product {getProduct = 6}
>>> mconcat [] :: Product Int
Product {getProduct = 1}

Some other examples of monoids:

  • Functions form a monoid under function composition, with identity f(x) = x.
  • NxN matrices under matrix multiplication, with identity "the NxN identity matrix".
  • Strings under string concatenation, with identity "".
  • Dictionaries under dict-merge (left or right), with identity {}.

For all of these you shouldn't need to define a default value when reducing in principle, though getting a non-Haskell language to automatically determine that you're reducing over a monoid is more trouble than it's worth.

I think it's interesting that most of those examples are collection types and not primitives. Probably just because there are more collections than primitives.


  1. The original title was "why any([]) is true", which was a dumb typo on my part. As the body of the essay discusses, any([]) is false. I meant "all([]) is true" and fixed the title of the post. 

  2. The technical term for this is that product should be homomorphic. An example of a non-homomorphic function would be mean: mean(xs + ys) != mean(mean(xs), mean(ys)) 

  3. At least not over a normal set of numbers. There's an identity if you're doing modular arithmetic, in which case you're fine defining the empty max. Also no less than five readers pointed out that you can use -infinity as the identity for max. ♥ u guyz 

Advice for new software devs who've read all those other advice essays

19 December 2023 at 23:40

Someone recently asked me if I had advice for early-career programmers. At first I thought this was a silly question. I only entered the workforce ten years ago; many of my newsletter subscribers have been programming for longer than I've been alive!

Then I went and read some "advice for starting programmer" essays and thought of some things they missed. So here's thirteen bits of advice for early-career programmers. Some of it is contradictory.

  1. People don't listen to me because I'm a good programmer, they listen to me because I'm a good writer. The same is true of pretty much everybody you'll read. This doesn't mean you should automatically reject everything, but it means you should carefully think about it and evaluate how it applies to your situation. And take any argument about "objective truth" with a grain of salt: there is very little about software that's been scientifically studied, and most of the research is inconclusive.

  2. But also don't worry too hard about getting "tricked" or learning "the wrong thing". If you like someone's ideas, try them out! As long as you're not actively sabotaging your coworkers then it'll probably work out in the end, even if you look back and think "I should have done that different." That's what learning is all about!

  3. Read the book Debugging: The 9 Rules. Borrow it from the library or ask your company to buy a copy, whatever. It's a real easy read and teaches an important skill all the other "beginner programmer" books barely cover.

  4. At some point you will discover the Right Way to program, the thing which makes this all make sense, and you'll be convinced that the whole field would be so much better off if everybody else programmed the Right Way, too. For me the Right Way was test-driven development; for you it might be functional programming, lisp, formal methods or one of a million other things.

    I'm not going to tell you to not get swept up in the Right Way, because that's pretty much impossible. And honestly it feels really great to discover the Right Way and life's too short to not feel good. Just be mindful of the fact you're getting swept up and try not to make your identity the Right Way Guy. Eventually the honeymoon will end and you'll learn that programming is frustrating and messy regardless of which Right Way people use, and that you can also make great software without doing it the Right Way. Over time you'll learn fifty other Right Ways and learn to mix and match them to the problem at hand.

  5. When you first encounter the Right Way, it will likely be from someone who went full Right Way Guy. Try not to hold it against them later. And try not to conflate the actual technique with how the RWG pitches the technique. Most ideas need some modification from their purest form to integrate well with other ideas.

  6. Julia Evans once said "behind every best practice is a horror story." If you don't understand a Best Practice, look for the horror story that inspired it. It might make the best practice make sense. It might turn out to be something that's completely irrelevant to you, and then you can feel comfortable doing a different practice instead.

  7. Building on the last tip: a lot of best practices and conventions are "path dependent", arising from a mix of historical and cultural factors. There are things we do because our mentors do it, who do it because their mentors did it, who did it to address issues that aren't as relevant anymore. If something sounds like a just-so story, it very well might be. You can often retrace the whole path if you're willing to look.

  8. Take walks.

  9. Almost every tool you use has some form of hidden depth, from your programming language to git to JIRA. Don't feel like you have to become an expert in every single one, but do consider spending 5-10 minutes learning a bit more about what it can do.

  10. Talk to people in other parts of the company: support, business domain, sales, etc. Consider shadowing them if you have the time (and feel comfortable asking). You'll be surprised by what you learn!

  11. If possible, try to do a few different types of programming earlier in your career. This doesn't have to mean switching jobs: most companies are doing several different kinds of programming at once. So like if you're starting in a webdev company, try some frontend, some backend, some operations, some database stuff, and so forth. This helps you learn, but far more importantly increases your chances of finding a kind of software work that you really really like. My first job was as a frontend dev and I was miserable. Later I moved to backend and was much happier, as were the people who wanted to spent more time doing frontend!

  12. You've probably heard the advice that software as a field is changing all the time, and that you shouldn't get swept up in the framework treadmill, just focus on learning fundamental skills. This is a true, but doesn't explain the "why". For structural reasons, information in software propagates really quickly. This is due to a lot of factors (internet, open source, conferences) but overall there's a lower barrier to sharing ideas in software. So it's really easy to get a lot of people to know about someone's pet project, even if only one person uses it.

    The upshot of this is that a lot of the technologies you hear about have very small userbases and will never get wide adoption, but it won't seem that way from how you hear about it. That's why it makes sense to be conservative. If you hear about something that gets you excited, go ahead and be an early adopter, otherwise it's okay to wait a couple years to see if it has legs.

  13. Ultimately none of us can predict the future, just as none of us could have predicted the present. Just try to do the best you can, live according to your values, and enjoy the ride.

That's all from me for the year; new workshop dates when I get back. See you in 2024!

I formally modeled Dreidel for no good reason

12 December 2023 at 20:14

New blog post

Notes on Every Strangeloop 2023 Talk I Attended. Would have been out earlier but I was busy and sick. Patreon is here.


I formally modeled Dreidel for no good reason

It's the sixth night of Channukah and I'm thinking about dreidels:

A dreidel

On Channukah we use dreidels to play the game dreidel:

  1. Every player starts with N pieces (usually chocolate coins). This is usually 10-15 pieces per player.
  2. At the beginning of the game, and whenever the pot is empty, every play antes one coin into the pot.
  3. Turns consist of spinning the dreidel. Outcomes are:

    • נ (Nun): nothing happens.
    • ה (Hey): player takes half the pot, rounded up.
    • ג (Gimmel): player takes the whole pot, everybody antes.
    • ש (Shin): player adds one of their coins to the pot.
  4. If a player ever has zero coins, they are eliminated. Play continues until only one player remains, or (much more likely) everybody gets bored.

Traditionally this game is played by kids, as it's an excellent way to convince them that gambling is boring as all hell. In 2015 Ben Blatt simulated the game and found that for P=4 and N=10, the average game will take 860 spins to finish.

But this was based on a mere 50,000 simulated games! We can do better than a mere simulation. We can mathematically model the game and get exact numbers. So we're busting out PRISM.

Modeling Dreidel in PRISM

PRISM is a probabilistic model checker. You create a model of a probabilistic system, like queueing workers or a gambling game, and PRISM can give you exact probabilities of various events happening. I've written two blog posts on PRISM:

If you read those blog posts, you'll quickly learn that I really do not like PRISM. The problem is that it is incredibly limited language. It doesn't have strings, or arrays, or functions. This makes generalizing or abstracting things virtually impossible, and you're going to be hardcoding a lot of stuff. Despite all that, I keep finding excuses to use PRISM anyway. I've got an enormous number of small PRISM tricks I will never share with anyone because nobody else will ever use PRISM. Maybe there's a grad student who needs them, I dunno.

I'm not going to go through the entire model, you can see it here. Just some highlights:

formula Players = 4; //Players
formula maxval = M*Players;
formula done = (p1=0) | (p2=0) | (p3=0) | (p4=0);

Since I can't abstract over anything, I can't make a general model that works for P players, so instead I hardcode in 4 players. Also, properly modeling "playing until one player is left" would have been a nightmare, so instead I'm just modeling until one player is eliminated.

    [spin] ((pot != 0) & !done & (turn = 1)) ->
      0.25: (p1' = p1-1) & (pot' = min(pot+1, maxval)) & (turn' = 2) //shin
    + 0.25: (turn' = 2) //nun
    + 0.25: (p1' = min(p1+halfpot,maxval)) & (pot' = pot-halfpot) & (turn' = 2) // hay
    + 0.25: (p1' = min(p1+pot,maxval)) & (pot' = 0) & (turn' = 2); //gimmel

This is the core of a PRISM spec: a set of conditions for this to happen, a set of weighted outcomes, and the updates for each outcome. Though it's mathematically impossible for a shin to make the pot have more than maxval coins, whatever model checking algorithm PRISM uses couldn't tell that, so I had to manually cap it with (pot' = min(pot+1, maxval)). I spent a lot of time on this spec coddling the model checker.

Also notice that I had to hardcode that this is all for player one (p1). This means that I have to write this action three more times:

    [spin] ((pot != 0) & !done & (turn = 2)) ->
      0.25: (p2' = p2-1) & (pot' = min(pot+1, maxval)) & (turn' = 3)
    + 0.25: (turn' = 3)
    // etc etc etc

All of the actions are guarded on pot!=0. When it's empty, we instead have an ante action:

    [ante] (pot = 0) & !done -> (pot'=pot+4) & (p1' = p1-1) & (p2' = p2-1) & (p3' = p3-1) & (p4' = p4-1);

Since this means not all actions will be spins, I added a reward to track just the spins:

rewards "num_spins"
    [spin] true : 1;
endrewards

Querying the system

So why bother with all this when the Python would be like ten lines? The reason is that once you've gone through all the work of modeling something in PRISM, you get a powerful querying system to learn properties of the model. First, let's just find the expected value (weighted average) for the number of spins to knock out one player:

$ ./prism dreidel.prism -const M=10 -pf 'R=? [C]'    

Value in the initial state: 63.71149441710945
Time for model checking: 174.563 seconds.

So if we have four players, and each player starts with 10 coins, it'll take on average 64 spins to knock a single player out. I stopwatched a dreidel and it takes about 8 seconds for a spin, so you're looking at about nine minutes of watching a top spin before someone gets to leave the table.

Of course that's only the likeliest outcome. I can also get the cumulative probability like this:

const C0;
P=? [F<=C0 done]

That returns the probability that within C0 steps of starting, done is Finally true. Note this is total model checking steps and includes antes too. While I can run this from the command line, it's nicer to see in the GUI.

A graph of the gui showing that 90% of the time, you'll knock out one player within 150 steps

Those are still pretty easy to find from a regular simulation, so let's try a more complex property. What's the probability that the game ends because a player rolled a Shin, as opposed to an ante?

prism dreidel.prism 
  -pf "P=? [F (pot != 0 & !done & (X done))]" 
  -const M=6:4:10 
  -exportresults stdout

M       Result
6       0.2944666679018678
10      0.2930102532933088

(Don't forget to check the probability it ends because of an ante!)

What's the probability that the first player will have 75% of all pieces in the game, and still lose?

-pf "P=? [F (p1=maxval-M & (F p1=0))]"
-const M=5:5:15 
-exportresults stdout
M       Result
6       1.1876357695852015E-10
10      1.0185162507865054E-8
15      4.902270837317093E-8

It's extraordinarily unlikely but becomes less extraordinarily likely when you start with more coins. And again we're just modeling until one player runs out of coins, so the odds are slightly (slightly) higher than this.

Finally, the chance the pot maxes out at C0 coins:

-pf 'P=? [(F pot=C0) & (G pot <= C0)]'
-const M=10,C0=0:1:20

A graph showing the maximum pot size probability peaks at 6 coins and almost never goes above 9

Was this worth it?

Ugh no I hate PRISM

But also I think I figured out a different approach to modeling this that would better handle adding players and running until just one player is left. Maybe that'll be a blog post for next Channukah!


End of Year Administrative Stuff

Next week will be the last newsletter of the year. Quick updates on this year's projects:

  • Logic for Programmers: Haven't made any progress on it for a while, but a friend has kindly offered to help edit and keep me to a schedule, so I'm more optimistic for the next year.
  • Why programming languages doesn't have graph types: Done with interviews, writing an outline!

I want to write a new formal methods talk for next year, which would focus on the general history and idea of formal specification instead of just pitching TLA+ as a product. If you know any conference people who'd be interested, put me in touch!

When would you ever want bubblesort?

4 December 2023 at 20:23

There are very few universal rules in software engineering, but there are are a lot of near-universal principles. Things like "prefer composition to inheritance" is near-universal. I love finding the rare situations where these principles don't hold, like where you do want inheritance over composition. A similar near-universal principle is "don't use bubblesort". Some would even say it's a universal rule, with Donald Knuth writing "bubble sort seems to have nothing to recommend it, except a catchy name and the fact that it leads to some interesting theoretical problems".1 But Knuth's been wrong before, so let's see if this universal rule is only near-universal.

Theoretically, bubblesort is faster than quick or mergesort for small arrays. This makes it useful as part of a larger sorting strategy: most of the fast-in-principle sorting algorithms work by recursively sorting subpartitions of an array, ie if you apply quicksort to 2^20 random integers, at some point you're sorting 2^17 8-integer subpartitions. Switching over to bubblesort for those subpartitions would be a nice optimization.

Many production sorting algorithms do use a hybrid approach, but they overwhelmingly use insertion sort instead. Insertion sort is very fast for small arrays and it's also better at using the hardware. On some very particular hardwares bubblesort stills ends up better, like in this NVIDIA study, but you probably don't have that hardware.

So that's one use-case, albeit one still dominated by a different algorithm. It's interesting that NVIDIA used it here because gamedev has a situation that's uniquely useful to bubblesort, based on two of its properties:

  1. While the algorithm is very slow overall, each individual step is very fast and easily suspendable.
  2. Each swap leaves the array more ordered than it was before. Other sorts can move values away from their final positions in intermediate stages.

This makes it really good when you want to do a fixed amount of sorting work per frame. Say you have a bunch of objects on a screen, where some objects can occlude others. You want to render the objects closest to the camera first because then you can determine which objects it hides, and then save time rendering those objects. There's no correctness cost for rendering objects out of order, just a potential performance cost. So while your array doesn't need to be ordered, the more ordered it is the happier you are. But you also can't spend too much time running a sorting algorithm, because you have a pretty strict realtime constraint. Bubble sort works pretty well here. You can run it a little bit of a time at each frame and get a better ordering than when you started.

That reminds me of one last use-case I've heard, apocryphally. Let's say you have a random collection of randomly-colored particles, and you want to animate them sorting into a rainbow spectrum. If you make each frame of the animation one pass of bubblesort, the particles will all move smoothly into the right positions. I couldn't find any examples in the wild, so with the help of GPT4 I hammered out a crappy visualization. Code is here, put it here.

(After doing that I suspect this isn't actually done in practice, in favor of running a better sort to calculate each particles final displacement and then animating each particles moving directly, instead of waiting to move for each bubblesort pass. I haven't mocked out an example but I think that'd look a lot smoother.)

So there you go, three niche use cases for bubblesort. You'll probably never need it.


New Quanta Article!

Okay so I didn't actually write this one, but I played a role in it happening! A while back a friend visited, and we were chatting about his job at quanta. At the time he was working on this mammoth article on metacomplexity theory, so naturally the topic of problems harder than NP-complete came up and I recommend he check out Petri net reachability. So he did, and then he wrote An Easy-Sounding Problem Yields Numbers Too Big for Our Universe. Gosh this is so exciting!


  1. Taken from here

How do you take the average of two timestamps?

10 November 2023 at 00:22

New Blog Post

A better explanation of the Liskov Substitution Principle. First new post since August! Patreon notes are here.

How do you take the average of two timestamps?

Let's send off the 6x era with something silly. Some physical units of measure, like timestamps, can be subtracted but not added. What does it mean to add 2 PM to 6 PM? That makes no sense. But you can subtract 2 PM from 6 PM and get 4 hours. The output we get is a different type entirely, a "timedelta", that we can add to either timestamp.

I've seen these kinds of units variously referred to as interval measures or affine measures. Other affine measures are things like map coordinates, test scores, and temperatures. All things where finding the sum isn't sensible but finding the difference is. You can think of the "point" and the "delta" as being different types, where operations with two points or two deltas produce deltas, while operations with a point and delta produce a point.

About that averaging

Today the sun rose at 6:32 AM and set at 4:35 PM. When was high noon?

The easiest way to find a midpoint between two numbers is to just do (a + b)/2, but we can't just add the two timestamps. Instead we have to do this:

from datetime import time
sunrise = datetime(2023, 11, 9, 6, 32, 0)
sunset = datetime(2023, 11, 9, 16, 35, 0)
>>> sunrise + (sunset - sunrise) / 2
datetime.datetime(2023, 11, 9, 11, 33, 30)

This way we're not adding timestamps to timestamps, we're adding timedeltas to timestamps. All the math checks out.

By contrast, here's what happens if we do things wrong:

>>> h, m = 16+6, 32+35
>>> h, m
(22, 67)
>>> h/2, m/2
(11.0, 33.5)

wait that's the same thing

Okay so here's why we have affine measures in the first place: we use them when we're not measuring from an objective zero point. For some measures, like dates, there is no objective zero, while for others there's a true zero that's too inconvenient for general use. In this context, addition is sensitive to our choice of zero, but subtraction is not— nor, curiously, is averaging.

This is easier to see with temperature measurements. Absolute zero in Kelvin is 0 K, while for Celsius it's -273° C. So Celsius is an affine measure: we get totally different numbers if we add two temperatures in °C than if we add them in K.

>>> K = lambda x: x + 273
>>> K(30 + 20)
323
>>> K(30) + K(20)
596

But we can meaningfully subtract two temperatures, and get the same results regardless of which measure we use.

>>> 30 - 20
10
>>> K(30) - K(20)
10

And in fact it doesn't matter which offset we use, either: if K = C - 123.46223 we'd still get the same result for K(30) - K(20) because the offsets cancel out.1

The offset explains why addition is weird and subtraction makes sense: if f(x) = mx + b, then f(x) + f(y) = m(x + y) + 2b, while f(x) - f(y) is just m(x - y). This also explains why averages work:

mean(f(x) + f(y) + f(z))
  = (mx+b + my+b + mz+b)/3
  = m(x + y + z)/3 + b
  = f(mean(x, y, z))

And again, it doesn't matter what m and b are: averaging is well-behaved for affine measures regardless of their "affinity".

But what about our poor computers

Just because we can mathematically average affine units doesn't mean computers can do it. I don't know of any languages where you can say "these units can't be added, unless it's part of taking the mean, then it's fine." So what if we avoided adding the units? Here's a different algorithm to find the average of N timestamps:

  1. Pick some arbitrary origin, like 1970-01-01 00:00:00 UTC.
  2. Subtract the origin from every timestamp, giving us a set of N timedeltas.
  3. Take the average of the timedeltas, giving us the mean timedelta x.
  4. Add x to the arbitrary origin.

The cool thing here is when I mean an arbitrary origin, I mean arbitrary. It literally does not matter what you choose. And it just so happens that if I pick "12:00 AM today" as my origin, then the timestamp 6:32 becomes the timedelta "6 hours and 32 minutes". Then we're effectively subtracting 0 from every timestamp, taking the mean, and adding the mean to 0. Which is just adding the timestamps and dividing by N.

>>> origin = datetime(2023, 11, 9, 0, 0, 0)
>>> str(sunrise.time()), str(sunset.time())
('06:32:00', '16:35:00')
>>> str(sunrise - origin), str(sunset - origin)
('6:32:00', '16:35:00')

>>> noon = (sunset - origin + sunrise - origin) / 2
>>> str(noon)
'11:33:30'
>>> str(origin + noon)
'2023-11-09 11:33:30'

As for how we originally found the midpoint? That's just the special case where we pick origin = sunrise:

sunrise + ((sunset - sunrise) + (sunrise - sunrise)) / 2
= sunrise + (sunset - sunrise) / 2 

I don't have any programming insight or language implementation advice based on all of this. I just think it's neat how adding affine measures is mathematically invalid but there's a loophole that makes averaging them valid. Computers might not be smart enough to use that loophole, but we can!


  1. Things are a bit different when converting to Fahrenheit, since that has a scaling factor in addition to the offset. 

Some thoughts on software expertise

7 November 2023 at 16:54

(Note, there's some important administrative stuff after the newsletter.)

I think a lot about the nature of expertise.

While I write about all sorts of software topics, I'm actually paid to do formal specification work. Historically that's meant running workshops for companies or reviewing their specs. This means that I experience "being in expert" in relation to other people who are complete beginners, and that's influenced my thinking of what "expert" means. Often beginners get stuck on small things and pitfalls like using the wrong data type or not knowing the standard way to solve a problem. So I think of an "expert" as being someone who's internalized all the standard moves and can just get on with solving the problem.

In other words, the main difference between a beginner and an expert is the expert has "seen it all before". They can quickly solve standard problems and can pattern-match new problems to old ones. But once they've built their mental models, the beginner will be able to do anything the expert can... just a lot slower.

This year I was also involved in two extremely large specification projects, one in TLA+ and one in Alloy. The projects were for very large and very complicated systems, and both took a lot more work to specify than the average production system. I joined the projects as a trainer and coach, but in both cases I was much more useful as a sort of "technique inventor". Formal specification is still a surprisingly young field and there's a lot that people don't know how to do. I had to figure out how to do it.

For example, it's generally very hard to compose two specifications together. People want to have a library of prefabbed specs they can glue together in the same way you can do that in programming, but conventional wisdom is you shouldn't do this, you should write them all together as one big spec. That's usually the better approach.

But in the TLA+ project we needed to model complex interactions between several distinct components and each component had complex internal behavior that we also needed to specify. The only way to do this is to make separate large specifications and combining them together. Except specifications don't easily compose. My role, then, was figuring out how to compose them anyway.

I solved it with something like this: first define each component's abstract state machine in a very particular style, where the state transitions where encoded as data. Then refine that into an actual component that uses action conjunction to tie implementation state updates to transitions. With that, it's possible to "compose" specs by linking their transitions with "synchronization points", which ultimately made the whole thing work.

Don't worry if that didn't make any sense— it took a while to explain to competent TLA+ practitioners too, and they got to see diagrams and actual code samples. Once we were on the same page it worked incredibly well, though. I should do a blog writeup of the technique, I think it'd be useful to a lot of people.

(Not every technique I came up with was this complicated, in fact a lot were much easier. Like in this article on optimization, I share a trick to rapidly profile optimizations: adding TLCGet("level") < N as a state constraint. It was really easy to show people but I don't think it's something a non-expert practicioner would come up with on their own.)

Inventing these techniques felt qualitatively different from the "pattern matching on problems" I think of as expertise. It's like if I learned English purely as a technical language, and then realized that "gear" sounds the same as "fear", and from that self-invented poetry. It sounds a little pretentious saying it that way, but at the same time I'd guess it's something you've all experienced, because I'm sure you're all experts in something or another.

Besides that TLA+ project, I'm also building a large system in Alloy. We're still working on it but even now it might be the largest Alloy production model, ever. Certainly the largest one to use the new Alloy 6 features! We had to figure out how to actually use all of those features at scale.

Instead of coming up with a couple of big new techniques over a whole project, it's more like finding a new small technique every day. Like "how to show the exact step in a trace where a property is violated" or "how to find all atoms satisfied P at any point in the trace" or "how to check a spec against a bunch of different classes of initial configurations."1 It's the kind of workhouse stuff that has the potential to become best practices once I get around to writing it all up.

This aspect seems really tied in with the nature of "being an expert". I couldn't tell you what a C# expert knows about C#, but I can tell you how a C# expert can think about C#.


Newsletter Scheduling Updates

First thing, there will be a regular newsletter this Thursday, and after that I'll be in Hong Kong for a couple of weeks. Between that and Thanksgiving there won't be any more newsletters until Decemberish.

Once the newsletter resumes it will be back on a weekly update schedule. I've done the 6-a-month experiment for a year now and don't think it's sustainable long term, and it's also made it hard to write anything else. There might still be a surprise newsletter occasionally but I'm not committing to the same pace.


  1. 90% of the time the answer is functions. Lots and lots and lots of functions. 

I shouldn't be so harsh on teaching TLA+ with puzzles

2 November 2023 at 20:24

I know I'm getting the newsletter out very late this week (this kind of lackadaisical nonprofessionalism is why I don't offer paid subscriptions) so I'll keep this one short.

Over the past couple weeks I've seen a few TLA+ articles where people show how to solve the wolf-cabbage-goat puzzle. You know the one where you have a farmer who has to cross a river in a boat, but he can only take one item each trip, and has to keep the wolf from eating cabbage or the goat from eating a wolf?1 It's a common example used to introduce formal methods it makes me worried that people'll think FM is only useful for solving toy puzzles. I prefer to teach with examples of real-world systems.

A similar example is the SEND MORE MONEY puzzle in all the constraint solving books: assign unique digits to letters such that SEND + MORE = MONEY. I've read books that had dozens of examples, all like this, and I was left wondering what I was supposed to do with constraint solving.

Anyway, I now think I'm being too hard on these kinds of examples. Two reasons for this. First, it provides more opportunity for practice. Getting good at these things takes time and work, and you need to understand the domain to model it. So practicing TLA+ on a bunch of puzzles might be easier than practicing on a bunch of real-world systems, since you don't need to spend as much time understanding the premises of a puzzle as you do a whole system. I noticed this while learning Picat: I did so in order to solve a practical problem, but I learned and practiced by writing Petri net solvers.2

Now the other reason is a little more philosophical: what makes computation interesting? "Computation" is not the same as programming. This is programming:

>>> print("hello world!")
hello world!

And this is programming that's also computation:

>>> 100**(1/17)
1.3111339374215643

Calculating the 17th root a number isn't something I can do myself, or even with a pocket calculator. But with a computer I can do it easily, as long as I know a programming language that makes it easy to express the computation. On the other hand, puzzles like wolf-cabbage-goat and SEND MORE MONEY are hard to express. If you want to solve them in Python, you'll have to write your own solver engine, and that'll take magnitudes more time than originally expressing the problem does.

So it's interesting that the TLA+ solution can be under 20 lines. It's shows that TLA+ can be used to compute things that ordinary programming languages cannot, at least not without a lot of extra programmer effort.

Learning TLA+ (or constraint solvers, or functional programmer, or whatever) expands the range of computations the user can easily do, and showcasing this with puzzles is a good way to demonstrate that. Just make sure to teach actual use cases too!

As penance for my prior grognardery, here's Tower of Hanoi in Picat:

import planner.

final({Rung1, Rung2, Rung3}) =>
  Rung1 = [], Rung2 = [], sorted(Rung3).

action(S0, S1, Action, ActionCost) =>
  Action = S1,
  ActionCost = 1,
  member(From, 1..3),
  member(To, 1..3),
  S1 = copy_term(S0),
  S1[From] = [H|T],
  if S1[To] != [] then head(S1[To]) > H end,
  S1[From] := T,
  S1[To] := [H] ++ S1[To].

main =>
  Start = {[1,2,3], [], []},
  best_plan(Start, 20, Plan),
  writeln(Start),
  foreach(Action in Plan)
    writeln(Action)
  end,
  nl.

Next week will have two newsletters as normal, and then I'll be on vacation until after Thanksgiving. Have a great weekend everyone!


  1. Goats are terrifying. 

  2. There need to be more Petri net puzzles. 

Raku is surprisingly good for CLIs

26 October 2023 at 22:13

A while back I wrote Raku: a Language for Gremlins about my first experiences with the language. After three more months of using it I've found that it's quite nice for writing CLIs! This is because of a couple features:

MAIN params become options

Like many other programming languages, Raku has a special-cased MAIN function that acts as the entry point into your script. But unlike f.ex C, MAIN can have an arbitrary signatures:

# strcmp.raku

sub MAIN($str1, $str2) {
  ... # blah blah blah
}

Then when you call the file:

> raku .\strcmp.raku
Usage:
  .\strcmp.raku <str1> <str2>

Raku automatically converts MAIN's parameters into script arguments! That's already pretty useful, and it gets better:

sub MAIN(
  $str1,         #= first string
  $str2 = "foo", #= second string
) {
  ... 
}
> raku .\strcmp.raku
Usage:
  .\strcmp.raku <str1> [<str2>]

    <str1>      first string
    [<str2>]    second string [default: 'foo']

So now we also have help messages and default values. To get flags and options, just define keyword parameters:

sub MAIN(
  $str1, #= first string
  $str2, #= second string
  Bool :$i, #= case insensitive?
  Int :$option #= some integer option
) {
  ...
}
Usage:
  .\strcmp.raku [-i] [--option[=Int]] <str1> <str2>

    <str1>            first string
    <str2>            second string
    -i                case insensitive?
    --option[=Int]    some integer option

Here's what the same CLI would look like in Python:

import argparse

if __name__ == "__main__":
    parser = argparse.ArgumentParser()

    parser.add_argument("str1", help="first string", type=str)
    parser.add_argument("str2", help="second string", type=str)
    parser.add_argument("-i", help="case insensitive?", action="store_true")
    parser.add_argument("--option", help="some integer option", type=int)

    args = parser.parse_args()

That's over twice as long to do the same thing— three times as long without the help strings. That said, there's a few special things I do with argparse that don't seem possible in Raku. But I still prefer the latter approach, especially when we throw in the other big feature:

Multimethods make subcommands easier

Raku has multiple dispatch: you can define multiple functions with the same name but different parameters, and then when you call the function it will pick the "most appropriate" method based on the supplied arguments.

# Instead of

sub myroot($val, $root = 2) { ... }

# You can write

multi myroot($val, $root) { ... }
multi myroot($val) { myroot($val, 2) }

This is already useful for defining parameter groups, like saying MAIN can accept either the --foo option or the --bar option but not both. But multimethods can also dispatch on values, like exact strings. That gives you subcommands!

multi MAIN( 
  "profile",
  $file,
  Bool :$verbose,
  :$output = $*OUT
) {
  ... 
}

multi MAIN( 
  "reformat",
  $file,
  Bool :$dry-run = False
) {
  ...
}
Usage:
  .\strcmp.raku [--verbose] [--output[=Any]] profile <file>
  .\strcmp.raku [--dry-run] reformat <file>

I use this to group a bunch of thematically-similar tools in the same file and distinguish calling them with the subcommands.

Miscellaneous useful things

There's a few small things that aren't as important as those two features, but still help with QoL.

  1. You can give flags "aliases" by writing Bool :v($verbose). Then you can call it on the CLI as either -v or --verbose.
  2. You can define the special hash %*SUB-MAIN-OPTS to configure how the arguments are parsed. F.ex if you add the :bundling key, then it will accept -abc in addition to -a -b -c.
  3. Raku provides some hooks to override CLI behavior, like if you define the ARGS-TO-CAPTURE function you can preprocess arguments before dispatching to MAIN.

One thing I've not be able to figure out is how to make variadic options, like having --src and --dest both take multiple paths. AFAICT the best you can do is make them both take flat strings and then postprocess them into paths. Maybe that'll get proper support in the future!

I wish people would do historical arguments better

24 October 2023 at 17:02

There's a common argument I hear that goes like this:

  • Pull requests were originally intended for low-trust environments, so we shouldn't use them on tight-knit teams.
  • Open source was originally coined to be palatable to corporations, so we should write Free software instead.
  • The web was originally for sharing text documents, so we shouldn't use it to run apps.
  • The wheel was originally developed for shaping clay, so we shouldn't use it to move things.

Okay I haven't seen the last one yet but it's the same kind of argument. All of these look at the historical context where a tool or practice arose, note that it's different from the current context, and conclude that it's unfit for its current purpose. I'm a firm believer in understanding the historical context of our practices, and I think that it does have weight in evaluating our current practices. But arguments based on history can be a bit misleading if you don't look deeper into the actual history. There's more work you have to do.

First of all, you have to establish what changed between then and now. Take the argument that "the web was originally for sharing documents". The World Wide Web entered public domain in 1993. Macromedia Shockwave came out just two years later. For 93% of the web's total life, at least, the public used it as a universal platform. So saying "the historical context is different" just tells us that the "original intention" badly mispredicted how people wanted to use the internet.

You also need to understand how the current context and how people use it differ. The common historical argument against pull requests and code review:

I see pull requests as being a way of gate-keeping in low trust environments like an open source project. Why would we (developers) adopt pull requests on our teams yet we don't consider our team/workplace being a low trust environment?

I could totally believe that pull requests were originally a means of (mission-critical) "gatekeeping in low-trust environments", which are different from how companies develop software. But then I'd ask are companies using pull requests as a form of gatekeeping? Or are they adapting the form to their own needs? Like at all the places I worked, PRs were used to bundle fixes for an issue-tracker bug into one changeset to get a trail of what was fixed, when, by who. That's a really useful thing to have! Also, I can trust my team but still use PRs as a way of getting a fresh pair of eyes on my changes, to make sure I didn't do something boneheaded or introduce a subtle problem only the deployment/security/performance expert would notice. My current context is different from the historical one, but I've also adapted the tool to fit the current context.

(You could respond with "you can do that with PRs but pairing/mobbing is better for those things." And that's fine, but we should discuss that with respect to how PRs are used, not how they were originally used. Plus, if things have been adapted before, perhaps there's further adaptions we should be looking for, too?)

Reading this, I feel like there's a third thing you-gotta-do that I'm missing, but I'm not seeing it.

How to do it better

I think I did a better job with my post Why Do Interviewers Ask Linked List Questions, which argues that companies originally asked linked list questions to gauge C experience, and then we sorta just kept asking them. I never actually say "so we should stop asking them" but I really strongly implied it so it's the same kind of argument. Here's how I addressed my two objections:

  1. How did the context change? No hard evidence (history is cruel like that), but I wrote six paragraphs arguing that it was due to Joel Spolsky + Cracking the Coding Interview.
  2. Do people use it differently now? I claim the only difference between then and now is that people say it's about "CS fundamentals"/"reasoning through problems", and that these questions are bad tests of reasoning.

(Still not seeing that third thing.)

My point is that you can do a historical argument well, and you can do it poorly. So you might as well do it well.


Misc Project Updates

  • The Graph Project is going well! I've interviewed three people so far and am starting to build some theories on why we don't have graph libraries. It's also got me deeply curious as to why all modern languages do have mapping types. Like from a historical perspective, not a use-case perspective.

  • Logic for Programmers is far behind where I want it to be, and I don't think it'll actually be worth money anytime soon. My current plan is to release a free online website end of this year / early next year and then offer a PDF/print version to anyone who wants to pay.

  • I got like eight blog posts that are near- or at- first draft level, and I just need to stop being a perfectionist and finish them.

  • I've started posting "sketchpad" ideas over on the Patreon. Some of them might become blogposts or newsletters later.

Some thoughts on good spec properties

10 October 2023 at 16:27

Lots of updates this week! First of all, my talk Is Software Engineering Real Engineering is now available online! It's the video version of the crossover project. Link is to the video, slides, references, and FAQ. I know a lot of people have been waiting for this; I hope it lives up to expectations :)

Second, the TLA+ workshop is on Monday! There's still a couple slots if you want to sign up; use the code C0MPUT3RTHINGS for 15% off.

Like with any discipline, using TLA+ (and formal methods more broadly) takes a lot of different skills. One of them is choosing good properties to verify. I want to explore what we mean by "good properties". Let's take a real simple model of wire transfers:

EXTENDS TLC, Integers
CONSTANTS People, NumTransfers

Money == 1..10

(* --algorithm wire
variables
  acct \in [People -> Money];

define
  NoOverdrafts == "???"
end define;

process wire \in 1..NumTransfers
variable
  amnt \in 1..5;
  from \in People;
  to \in People;
  ready = FALSE;
begin
  Check:
    if acct[from] >= amnt then
      ready := TRUE;
      Deposit:
        acct[to] := acct[to] + amnt;
      Withdraw:
        acct[from] := acct[from] - amnt;
        ready := FALSE;
    end if;
end process;
end algorithm; *)

The bug here is that if we do two wires, they can race each other past the Check clause and leave someone with a negative balance. We want to write NoOverdrafts to catch this error.

There's two ways to do that. We can say that ready => (acct[from] >= amnt) {1}.1 We can also say that acct[from] >= 0 {2}. Notice they'll both fail for the same behavior of two concurrent wires, but {1} will fail a couple of steps earlier and give a shorter error trace. In my experience, that makes {1} closer to how we think about good programming practices: make errors appear as early as possible.

But I'd argue that {2} is a better property for specification:

  1. {1} says the bad thing is very likely to happen, but it's not guaranteed.2

  2. {1} is an internal property while {2} is external. An outsider observer can't see the value of ready but they can see if acct is negative. So {2} being violated can actually have downstream effects on the external world, which makes it more important.

  3. {2} is mathematically stronger. Right now we only have one type of action in this system, Wire, but complex systems have lots of different actions. Model checkers are looking for a worst-case scenario across all sequences of actions, and that might not include a Wire action. So {2} will automatically catch things that {1} misses.

Note that writing {1} can still make sense, since it'll fail earlier with a shorter error trace. But {1} is only useful if you also have {2}. My point is that in specifications the broader properties are "more important" than narrow ones. Beginners to FM tend to write narrow properties because they're used to writing narrow tests, and that's something they have to unlearn.

Why the difference? Most people run test suites as part of their coding process, so they want the suite to run quickly. That puts lots of constraints on what you can easily check. Model checking a specification usually happens separately, so isn't as tightly time-constrained. Broader properties make sense.

On writing that, I wonder if I should think of {1} as a "regression properties" kinda similar to a regression test. IE only write {1} after I see {2} fail and see the error trace, as a way of more quickly debugging regressions in the spec.

Pulling this all together. Some heuristics for good properties:

  • Represent an actual failure in the system, not just the potential for failure
  • Are observable differences in the external state, not the internal state
  • Are broad enough to cover many different narrower properties

In conclusion, I just realized I have like a million more proto-thoughts on TLA+ and the nature of specification all bouncing around inside. Maybe for a future newsletter.


  1. "If a wire is ready, its amnt is ≤ acct[from], otherwise ignore this requirement". 

  2. Wire X won't overdraft if Wire Y deposits money in between X's deposit and withdrawal. 

Picat is my favorite new toolbox language

3 October 2023 at 14:45

I'm teaching a TLA+ workshop in two weeks! Register here, use the code C0MPUT3RTHINGS for 15% off.

I always like to find new "toolbox languages". These are languages that can quickly be used to quickly solve a set of problems with just the base language and without a whole lot of typing. That's what drove me to learn Raku, so I could quickly do things like "generate 5 random 10-character strings":

for ^5 {say ('a'..'z').roll(10).join}

One toolbox language I've had my eye on is Picat,1 a "logic-based multi-paradigm programming language aimed for general-purpose applications". I'm not sure where I first learned about it, maybe around the same time I encountered MiniZinc? The userbase is extremely small, and as with many small communities, the documentation assumes you have the same background as the inventors. But it's also an incredibly interesting language and I'd like to try explaining why.

"Logic-Based"

(Heads-up, I've only dabbled in Prolog a little and I'm sure I've got a hundred subscribers who are much better at logic programming than I'll ever be. Nonetheless I'm going to do my best to explain it.)

Logic programming is, very broadly, where you program by writing a set of equations and ask the program to find values for each variable for that make all the equations true. For example, if I want to concatenate two strings in Picat, I'd write

Picat> [1,2] ++ [3, 4] = Z.

It will then try to find a matching value for Z:

Z = [1,2,3,4]
yes

The yes means that it was able to find a satisfying value for Z. The cool part is that this is bidirectional: it can also find inputs that give the output!

Picat> [1,2] ++ Y = [1,2,3,4].  
Y = [3,4]
yes

When if there are multiple satisfying solutions (unifications), Picat can list them all. EX writing member(X, [1, Y]) gives both X=1 and Y=X as possible solutions.

"Multiparadigm"

Picat is, at its heart, a logic programming language. It borrows a lot of core ideas and most (most) of its syntax from Prolog. But it also adds two paradigms on top of that.

The first is constraint solving. Like logic programming, constraint solving is about finding values that match a set of equations. Unlike logic programming, we're exclusively talking numbers. Usually we're also trying to find the solution that mini/maximizes some other metric.

  % Maximize area for given perimeter

  Vars = {Len, Width},
  Vars :: 0..10, 
  Len*2+Width*2 #<= 22, 
  solve([$max(Len*Width)], Vars).  

Combining constraint solving and logic programming isn't that usual; SWI-Prolog has the CLP(FD) module. Picat's far more unusual paradigm is imperative programming:

{1, 2} ++ {3, 4} = Z,
foreach (I in 1..Z.len)
    Z[I] := Z[I] * 2
end,

Z = {2,4,6,8}

I never thought I'd see the day where a new language advertised "we're more imperative" as a selling point, but it's actually really nice. I've had a few cases where I could mostly express a problem logically but need to use a bit of imperative assignment to get it all working. Maybe I wouldn't need it if I was an experience prologer. But for n00bs like me it's a great escape hatch.

My favorite feature

The planner module took me from "this could be useful" to "damn this is cool". You define an initial state, a final state, and a set of actions, and it tries to find a sequence of actions that get you from here to there. It's like a model checker in reverse!

Here's an example problem: we have a list of numbers that we add one at a time to reach a target number. If we have input [2,2,3,4] and had to get 9, we could do 2,3,4 to go 2->5->9. Now let's also add in "forbidden numbers" you cannot pass through. If 5 was forbidden, you'd instead have to do 4,2,3 and go 4->6->9.

In Picat:

import planner.

action(S, NextS, Action, ActionCost) =>
   S = {Target, Current, L, Forbidden},
   member(X, L),
   Next = Current + X,
   not member(Next, Forbidden),
   NextS = {Target, Next, delete(L, X), Forbidden},
   Action = {add, X, Next},
   ActionCost = 1,
   true.

final({Target, Current, _, _}) => 
  Target = Current.

Our state is a 4-tuple consisting of the target value, the current value, the numbers we can add (including duplicates), and the forbidden numbers. We only have one action, which picks a number, checks that adding it to Current doesn't give us a forbidden number, and then sets the values in the next state (NextS). The ActionCost needs to be assigned but doesn't matter here.

main =>
    S0 = {19, 0, [6, 6, 9, 5, 9, 2], []},
    S1 = {S0[1], S0[2], S0[3], [17]},
    plan(S0, Plan0),
    plan(S1, Plan1),
    writeln(Plan0),
    writeln(Plan1),
    nl.

We could also call plan(S0, Steps, Plan0, Cost) in which case it would only find plans up to Steps long and with a cost under Cost.

% output
[{add,6,6},{add,6,12},{add,5,17},{add,2,19}] % Plan0
[{add,6,6},{add,6,12},{add,2,14},{add,5,19}] % Plan1

How cool is that?!

This gets especially useful when you combine it with the other language features. One thing I've done is put a constraint problem inside of final and then add an action that loosens the constraints. So it first tries to solve the basic problem, and if it can't it gradually loosens the constraints until it's possible.

My use cases

Right now Picat is too unpolished to be "general-purpose" language. The documentation is especially sparse and highly technical, I had to do a good amount of experimentation to figure out how everything. I still have no idea how the index keyword works or how to use the actor features. I don't know if it's suited to become a general-purpose language, the kind of thing you'd write a webserver in.

But it's a fantastic toolbox language. I was originally motivated by a vacation scheduling problem: we had a set of activities we wanted to do, each with a set of restrictions. IE certain activities couldn't be done on the same day, all the things that required a car had to happen in a 24-hour span, stargazing needed a clear sky and being outside the city, stuff like that. Picat elegantly handled all the requirements and spit back schedules.2 Two other ways I've used it:

  1. Testing if a logical puzzle has a unique solution or not. I've done this in both MiniZinc and Alloy, but Picat handles this better than either. I want to try using the planning module to also test puzzle variations.
  2. I wrote a petri net simulator in <70 lines of code. Check it out here!

I hope this convinced you to check Picat out! Two resources to get started:

  1. A User's Guide to Picat: First chapter is a real barebones tutorial, rest of the book is reference material. Lots about implementation details, not much on good practice.
  2. Constraint Solving and Planning with Picat: A lot more conventional introduction, really heavily emphasizes constraint solving over other paradigms. Also written ~circa Picat 1.3, it's up to 3.5 now.

There's also a WebIDE, but it doesn't show errors.


  1. PIE-cat 

  2. I learned Picat ~2 weeks after the vacation, we did the original schedule by hand 

Was Javascript really made in 10 days?

28 September 2023 at 14:30

I once heard that Javascript has so many warts because the first version was made in just ten days. I was curious 1) if this is true, and 2) if it explains the language's warts.

After some research, I can unconfidently say: it's complicated.

The "first version" of JavaScript did in fact take ten days. The exact dates aren't confirmed, but Brendan Eich recalls it being May 6-15, 1995. But this was only a minimal prototype ("Mocha") for internal demonstration. JavaScript 1.0 was publicly released in March 1996 (pg 10) and the first "complete" version in August 1996 (ibid). Even after that point, the Netscape team regularly tweaked the design of JS; Eich recalls that "Bill Gates was bitching about us changing JS all the time" in the Fall of 1996.

Eich also had about ten years of experience with language design and compiler developer and was explicitly hired by Netscape to put a programming language in the browser (pg 7). Originally this was supposed Scheme, but then Netscape signed a deal with Sun and agreed to make it more "Java-like".

Does this explain the warts?

Most of JavaScript's modern flaws are arguably not due to the short development time:

  • Mocha didn't originally have implicit type conversion, but users requested Eich add that to 1.0 (video link). He deeply regrets this.
  • JS 1.0 added null to be more compatible with Java (pg 13. Java compatibility is also why typeof null = object.
  • Any JavaScript API warts must have come afterwards because all the API work happened after Mocha. Mocha's a pretty minimal language!
  • The "all numbers are floats" problem was originally in Mocha, but I think that was always the intended behavior. The JavaScript 1.0 manual cites HyperTalk as a major inspiration. I've never used HyperTalk skimming the manual suggest to me it does the same thing (pg 102, 517).

I found one way the 10-day sprint definitely did harm JavaScript: Brendan Eich didn't have time to add a garbage collector, and later attempts to add that in added a bunch of security holes (43:04).

This newsletter takes the new record for "most hours of research per word."

Formal Methods can't fix everything and that's okay

26 September 2023 at 15:00

Finally done with conferences and glad to be back in Chicago. Next big event is the TLA+ workshop on October 16, which still has slots available! The faster I can fill those up, the sooner I can stop putting advertisements in my newsletters. I don't like doing it any more than you like reading it. Use the code C0MPUT3RTHINGS for 15% off.

Keeping on the topic of TLA+, I saw this question on linkedin:

I've always wondered - is there much use of building formally verified software on top of non-formally verified systems (like the OS the software runs on)? To me, the absolute layman here, it feels like we're building proofs on top of potentially/probably flawed axioms. If we can verify Raft is a solid algorithm but it's built on non-verified networking devices do we end up putting a wrong amount of trust into the system?

This is the impossible problem of trusting trust: is formally verified code really provably-correct if it's compiled with an unverified compiler? What if the code's correct and it's running on an unverified OS? You could use a verified OS like seL4 but what about the hardware itself? What if the hardware design is proven but the manufacturing process adds a backdoor? Verifying the whole process is too hard for anybody but the richest countries. So why do formal verification at all?

Mossad gonna Mossad

In times like these, I turn to the wisdom of James Mickens.

If you haven't heard of Dr. Mickens, he's a professor of distributed systems research at Harvard. Also he writes like this:

BREAKING NEWS: I've received tenure at Harvard! I want to thank all of the enemies that I had to destroy to achieve this great honor. Roger Davis at Princeton’s department of nutrition—you questioned my research on the efficacy of an all-Pop-Tart diet, but I am living proof that the diet works. Yes, I have nose bleeds every day and my pancreas has the dysfunction of a failing Soviet client state, but I believe that having constant double vision makes me twice as optimistic about life. [3500 more words]

Basically all of his writing and talks are great but the relevant one here is This World of Ours, his rant about computer security.

If the security people are correct, then the only provably safe activity is to stare at a horseshoe whose integrity has been verified by a quorum of Rivest, Shamir, and Adleman. Somehow, I am not excited to live in the manner of a Pilgrim who magically has access to 3-choose-2 {Rivest, Shamir, Adleman}, mainly because, if I were a bored Pilgrim who possessed a kidnapping time machine, I would kidnap Samuel L. Jackson or Robocop, not mathematical wizards from the future who would taunt me with their knowledge of prime numbers and how “Breaking Bad” ends. […] My point is that security people need to get their priorities straight.

He helpfully lists the three different priorities people should worry about, which I'll painstakingly transcribe here:

Threat Solution
Ex-girlfriend/boyfriend breaking into your email account and publicly releasing your correspondence with the My Little Pony fan club Strong passwords
Organized criminals breaking into your email account and sending spam using your identity Solution Strong passwords + common sense (don’t click on unsolicited herbal Viagra ads that result in keyloggers and sorrow)
The Mossad doing Mossad things with your email account Magical amulets? Fake your own death, move into a submarine? YOU’RE STILL GONNA BE MOSSAD’ED UPON

Yes this was all an elaborate way of saying the word "threat modeling", but if I just said "threat modeling" I wouldn't have been able to introduce you to James Mickens, now would I?

Mossad gotta budget

Threat modeling is fundamentally about cost-benefit analysis, on both sides of the threat.

Attacking a weak password is really easy, so it pays to attack even low-value targets. Adopting a strong password is also easy. If you're a low-value target, this will protect you from a lot. But if you're a high-value target, attackers will be willing to spend more resources attacking you. So you'll need to invest more in your defenses, which makes fewer people able to attack you. As this escalates, both the attacks and defenses get more and more expensive, as ever smaller groups of people can afford them.

This ends up being what protects us from the Mossad. We're ultimately beneath their notice because they want to spend their resources aiming at higher-value targets. Even so, we're vulnerable to people looking for easier meals, which means we still need strong passwords and 2FA and phish-literacy and stuff.

How this ties back to formal methods

Like with security, how much we do for correctness depends on how valuable our work is, and how easy it is to "attack" it.

Most work is "low-value", and bugs aren't going to cause much damage or cost a lot of developer time. This kinds of work doesn't even need formal methods; tests, type systems, and code review in some combination are enough to deal with them.

Some work is extremely high value. Militaries want verified OSes in addition to verified code. For a long time all of the seL4 users were militaries, though I think that's changed in the past few years. I'd guess they also have really strict manufacturing standards or something to prevent supply chain attacks.

Finally, some work is high-value enough that we should verify the code/design but not valuable enough to be worth investing in a verified OS or hardware stack. There might still be bugs from the OS but {cost of OS bugs} < {cost of verifying the OS} while also {cost of verifying code} < {cost of code bugs}.1

In fact we almost never want to verify the whole app, either! Generally we only apply formal methods to a subset of the system, like a single module or a single part of the abstract design, that is most important to get right. The point is that we apply it in ways that still add value.

I think this is where FM's potential as "provably correct" harms its adoption. We know type systems and unit tests can't catch all possible errors, so we don't expect them too, we're happy when they catch more errors than they catch. But we know it's possible for FM to catch everything, so we're reluctant to go only partway with it.


  1. I'm mixing verifying code and verifying designs here. I believe that often verifying code is not worth it while verifying designs is. 

❌
❌