❌

Normal view

There are new articles available, click to refresh the page.
Before yesterdayUncategorized

Using typing in Python leads to different sorts of code

By: cks
28 May 2026 at 01:56

So what happened is that I converted a big pile of (highly untyped) Python 2 to Python 3 recently, and then I wanted to experiment with typing-heavy Python LSP servers in GNU Emacs, so I decided to try them out by experimentally adding some type annotations to DWiki, the aforementioned pile of untyped Python (and the code powering Wandering Thoughts). The experience was educational and taught me some new things about type annotations, but it also firmed up my view that typed Python code is different than untyped Python code (although not quite to the extent that they create a different language, as I sort of felt before). There are idioms that are perfectly natural in untyped Python that are pretty annoying to deal with in typed Python.

One of these idioms is dictionaries with multiple types of values. For instance, DWiki has a dictionary that is basically 'a collection of information about the HTTP request'. The authentic type of the values in this dictionary is "str | bool | SimpleCookie | dict[str, str]", which is to say that values can be any of a string, a boolean, a HTTP Cookie, or a dictionary of string key/value pairs. Of course, individual keys in the dictionary have a fixed type for their value; for example, the key 'request-fullpath' only ever has a string value, so in untyped Python code it's natural to write something like:

if reqdata['request-fullpath'] and \
   reqdata['request-fullpath'][-1] != '/':
    [...]

If you do this in typed Python, your type checker will almost certainly complain that this indexing isn't valid for booleans and HTTP Cookies. You need to either check or type-assert that the value is a string.

In untyped Python, this is a perfectly decent data structure (although it might not be good style). In typed Python, this is a bad data structure that will cause you pain. There are ways around the pain that preserve the underlying dictionary, but they exist almost entirely to pacify the type checker. A proper data structure in typed Python is not multi-typed like this, or at least it's not multi-typed with a lot of keys.

(One way is to use typing.TypedDict, but if you have a lot of keys it gets painful).

There's a good reason for this insistence in typed Python, because right now there's nothing preventing me from putting in the wrong type of value for a particular key in this dictionary. I could slip up and set some key that's supposed to have a string value to a boolean, or a key that's supposed to have a dictionary to a plain string. Typing can't detect those errors because any of those are valid for the dictionary in general, just not for that particular key. A proper data structure in typed Python is one where the type checker itself can check your invariants, so string values are separated from boolean values and so on. This would probably also be clearer code.

This is a general issue for any sort of variable-typed container object, return values, or the like. I saw a similar thing when typing my program that uses the email packages; the email packages have old-school polymorphic API return values that typing is not fond of and that required type checks or casts. This is relatively valid on the part of programs determining typing (they're unlikely to ever do full flow control analysis to determine actual types), and is clearly part of the style of typed Python.

(Another case of this in DWiki is that I have a general caching layer that uses pickle to store and retrieve arbitrary objects. The callers know what they're storing and retrieving under a particular key, but this isn't visible in any types I could assign.)

As far as I can see, typing also changes how you want to structure multi-file code with classes and other data structures. In untyped Python such as DWiki, it's natural to have one source file declare a data structure, create an instance of it, and pass it as an argument to a function (or a class) from another file that the first file imports. In typed Python, this doesn't work so well. Because everything that either takes data structures as arguments or returns them wants to name the data structure in type hints, you need the classes for those data structures to be eventually be accessible in everything that touches them, which means a tangle of circular imports.

(This is different from forward references in that the code that accepts instances of these data structures will normally never import the code that defines them, cf.)

Circular imports work, technically (as I've sort of written about before), but they make me unhappy. I lack enough experience with typed Python to know the correct approach, but it certainly feels like one should define as many data structures as possible in low level files that are relatively standalone so they can be imported into everything without circular imports. I'm not sure how this works once you want to put methods on your classes that take other classes as arguments and so on.

(Mypy has some suggestions but its answers don't make me feel happy.)

Another practical issue I ran into was that DWiki has a stack of middleware functions to fiddle with HTTP requests. All of the middleware functions take a standard set of four arguments, each with a specific type, and I have enough of theses functions that going through and adding the appropriate type annotation to each argument for each function (and the return value) was clearly a pain (in my experiment I only did this for a few). I found myself really wishing for a way to say that the function as a whole had a particular type shape, which would automatically infer the argument and return types. I think the proper way to do this is to pass each function fewer arguments (ideally one), but I'm not sure I like it (and the four arguments aren't tightly coupled to each other).

(I also wound up feeling that I should create a 'types.py' file that had all of the basic type definitions that didn't depend on classes and so on. This would be things like the shape of callable functions, that 'data about the HTTP request' dictionary, and so on. Many of these are used in multiple files in DWiki and this avoids various sorts of annoyances. I don't know if such a 'types.py' file is considered a code smell.)

I don't regret my scratch experiments with adding some types to DWiki (partly because I learned more useful things about Python typing), but it's clear that doing it properly is somewhere between infeasible and impossible (and Python typing acknowledges that this can be the case). A reasonable typed version of DWiki would be structured significantly differently, and getting from the current code to any new type-friendly structure would be a significant rewrite (which would fix some old mess but likely introduce new mess).

(The semi-typed results of my experimentation are messy enough that I'm to discard that copy of the source code.)

(I said something about type hints on the Fediverse and some interesting things came up in the replies, eg.)

My views on some Python LSP servers in GNU Emacs (as of mid 2026)

By: cks
27 May 2026 at 02:06

Some languages have to make do with one LSP server. By contrast, Python has an embarrassment of riches; I know of at least five modern LSP servers for it. I've recently been experimenting with some of them in GNU Emacs, specifically Eglot, so before I forget I want to note down my views. The five Python things with LSP servers that I believe are modern and current are python-lsp-server ('pylsp'), Facebook's pyrefly, Astral's ty, Microsoft's pyright, and technically Astral's ruff.

The easiest to talk about is ruff, because it's not intended as a full-featured LSP server that does everything; instead it only does code diagnostics and formatting, and you need another LSP server for code navigation. Currently Eglot doesn't easily support multiple LSP servers and code navigation is a lot of what I care about, so direct use of ruff is off the table for me. Also off the table is pyright, since I don't have any interest in touching a Microsoft Python project or finding out how badly it works with anything other than VS Code (although there's basedpyright as a less-Microsofted pyright option).

Python-lsp-server is my default choice and is a solid basic LSP server with the code navigation features I normally care about, along with support for code diagnostics through either or both of mypy and ruff (via python-lsp-ruff). Python-lsp-server is also what I'd call a 'quiet' LSP server by default, without a lot of stuff popping up and being filled in in Eglot. It's supported by the community and is probably going to endure, but it's written in Python (so it's not the fastest thing) and my impression is that it's more focused on code navigation than on type checking your code. My view is that it's probably your best option if you have a lot of untyped Python code, which is my normal case.

(So after playing around with both ty and pyrefly for some time, I'm probably going to stick with python-lsp-server most of the time.)

Both ty and pyrefly are strongly into type checking and type annotations, in addition to supporting code navigation. Both support 'inlay hints' in Eglot, which fill in known or deduced types for you (and can also attach names to positional arguments in function calls; ty defaults this to on, pyrefly to off). There are some differences in what types they fill in, for example ty will tell me 'Unknown' for types while pyrefly is silent about them (with no inlay hint), and I suspect that there are differences in what types they deduce for things. I don't have enough experience with Python type checking to have strong opinions on the general choice between ty and pyrefly. Both support more or less all LSP code navigation features (ty's LSP documentation, pyrefly's LSP documentation), with pyrefly currently having one more supported navigation ('go to implementations', which lets you find the reimplementation of methods in sub-classes, and now that I've tried it that's kind of handy and it's not currently supported by python-lsp-server).

(Eglot allows you to easily toggle inlay hints off and on with 'eglot-inlay-hints-mode', in case you don't like the noise of them but do want, for example, pyrefly's code navigation. I'm not sure how much unwanted type diagnostics and notes pyrefly or ty will spit out at you on untyped, anarchic Python code bases.)

As before, I think setting up Python LSP support in GNU Emacs is worth it, especially if you're working with typed Python and pick a good LSP server for this. LSP server code navigation is really quite nice and will work across files in your Python project (and pyrefly's support for 'find everything that overrides this method' is handy if you have that kind of code base).

(GNU Emacs can do some amount of code navigation in Python code without a LSP, but you want to create and maintain a tags table and in brief experimentation the experience is not as smooth and more annoying.)

If you want the most deluxe Eglot based Python LSP experience, I think you want to set up pyrefly with however many inlay hints you want. Since I slogged through the effort to determine what special Eglot configuration you need for this, I will save people the effort:

(setq-default eglot-workspace-configuration
   '([...]
     :python (:analysis (:inlayHints (:callArgumentNames "partial")))
    )
 )

As (sort of) covered in pyrefly's LSP documentation, pyrefly doesn't use its own name for these settings, it uses names that pyright apparently originated. Fortunately Eglot will send (all of) your settings to whatever LSP you're currently running, regardless of their names. I believe you can also configure this in per-project configuration files, which would also let you entirely disable pyrefly type checking in places where you don't want it (per the configuration documentation).

(Some bits of the pyrefly experience in GNU Emacs will get more deluxe in GNU Emacs 31, when Eglot will acquire support for reporting things like call and type hierarchies.)

Sidebar: A brief experience with basedpyright

I ran a little poll on the Fediverse and a surprising number of people (to me) turned out to use pyright or basedpyright, so I gave it a try. The result is, effectively, a failure for my code. Even code that I thought was well typed and free of problems came out full of diagnostics in basedpyright's default configuration. It does have more or less the same code navigation features as pyrefly, but for me the cost of getting them is too high.

But if you want to write extremely strictly typed and careful Python code, basedpyright will make you do it (assuming you make it have no errors and keep its strict default settings).

(The poll also suggested that very few people use pyrefly, which surprised me a bit.)

Notes about reading messages with the Python email packages

By: cks
20 May 2026 at 03:01

I have a long standing personal program to display MIME formatted email messages in the terminal in a sensible way (it was mentioned in this old entry on my email tools and its comments). For a long time this was a Python 2 program, using the Python 2 version of the email package. Recently, I moved this program to Python 3 as part of my sudden enthusiasm for Python 3 conversions, using the Python 3 version of email and its sub-packages. In the process I have wound up with some notes and opinions on practical use of the Python 3 email packages.

(The Python 2 version of email had its own quirks and oddities, but I worked all of those out that hard way years ago, have mostly forgotten them since, and they're not interesting any more now that the era of Python 2 is over.)

The Python 3 email documentation will tell you that the modern interface for email messages is email.message.EmailMessage. The older email.message.Message is (theoretically) only there for Python 3.2 compatibility and you should ignore its methods and use only the EmailMessage methods. This is not entirely the case. If you look behind the curtain, you'll discover that many of the EmailMessage APIs for reading message contents are in fact Message APIs with masks on, and especially they're various masks for Message.get_payload(). That get_payload() isn't obsolete in practice matters, because it turns out that get_payload() is the only way to do certain things you (I) need.

As with decoding email headers, my strong impression is that the entire set of email parsing and message reading APIs are only really designed to deal with well formed email messages with fully correct MIME. This isn't what you find out in the real world, both due to programs being imperfect and also due to things like other mail systems sending you a bounce message that includes a message/rfc822 version of the original message where the other mail system has retained all of the message headers, including the Content-Type that says the original message was a multipart/alternative, but has replaced the entire body of the message with '(Body suppressed)'. As far as I can tell, there's no EmailMessage API that will give you (just) the body text of that (malformed) message/rfc822; your only way to dig it out is to use the older Message.get_payload() API.

(That bounce example is a real case that I've seen.)

At the same time, EmailMessage.get_content() is a handy API that does a lot of the work for you for things like extracting a de-mangled, Unicode version of a text part (or anything that's sufficiently text-like, although you will get back a bytes thing instead of a str and then decode it yourself). So I use get_content() as much as possible but some things have to fall back to get_payload(). The one thing I'm cautious about with get_content() is that it has a cheerful trust in the asserted character set encoding of the MIME part, when I'm pretty certain that some mail creation programs blithely assume you'll typically interpret stuff as UTF-8 (especially if it has no type specified, which in theory means ASCII).

(get_payload() will also probably give you heartburn if you're trying to use typing, but this is a general email problem with API typing.)

The email package parses your messages with stuff in email.parser, which has some additional notes on how it theoretically parses things. Some of these notes are experimentally false, especially the one for message/delivery-status. The actual story is in comments in the source code:

message/delivery-status contains blocks of headers separated by a blank line. We'll represent each header block as a separate nested message object, but the processing is a bit different than standard message/* types because there is no body for the nested messages. A blank line separates the subparts.

Although the actual text of a message/delivery-status part is plain text (admittedly in a specific format, in theory), the parsed version is a multipart EmailMessage object containing a series of text/plain EmailMessage children, where the actual contents are in the headers of those text/plain children (and the 'body' is empty). The best way to extract the actual contents as text to print or process them is to use EmailMessage.as_string() on each child. This is quite confusing if you expect a message/delivery-status to have obvious contents or to match the documentation (and EmailMessage.get_content() doesn't work right on the multipart parent object; this may be a bug that will be fixed at some point).

PS: The reason you don't want to use .as_string() on text or broken MIME parts is that MIME parts have headers, namely the various Content- ones, and .as_string() will give you those headers as well as the text you want. There's no option in the EmailMessage API to not get the headers.

Sidebar: Types for email stuff

Because sometimes I get enthusiasms, I added types to my program that's using email. It was somewhat painful and the kind of thing that you describe after the fact as "a valuable learning experience". In order for future me to not lose that learning experience, here's some notes.

My first problem was that often, mypy inferred that something was an email.message.Message instead of an email.message.EmailMessage; the latter is a subclass of the former. Much of this could be fixed with isinstance() to create type narrowing. I found the most convenient way to do this to be an assert(), for example:

prs = email.parser.BytesParser(policy=...)
m = prs.parse(fp)
assert(isinstance(m, EmailMessage))
[...]

Here I know that email.parser.BytesParser will return an EmailMessage because that's what my policy is set up to do (cf), but mypy can't see that.

A more involved situation is the return value of Message.get_payload(), which mypy typically typed as including 'list[Message]' when I know that what I have is a 'list[EmailMessage]'. Fixing this requires typing.cast():

def showalternative(p: EmailMessage) -> None:
  m = p.get_payload()
  if isinstance(m, str):
    [...]
    return

  assert(isinstance(m, list)) # for safety
  m = typing.cast(list[EmailMessage], m)
  [...]

You need to use typing.cast() to correct mypy's idea of the member type of a list or other container.

(Technically mypy and any other type checker that does similar inference. I don't know my way around the Python typechecker landscape, although I've wound up with a few of them installed.)

I've finally ported DWiki from Python 2 to Python 3

By: cks
15 May 2026 at 00:17

DWiki is the pile of code that underlies Wandering Thoughts. It started out many years ago as a Python 2 program (partly because there was no Python 3 at the time), and it stayed that way for a long time, making it the most significant and by far the most substantial Python 2 program I still cared deeply about. Years ago I said I'd port it to Python 3 someday and somewhat to my surprise, that day has now come (well, it came yesterday).

The direct trigger was discovering that Python 3.13 had dropped 2to3, which made me feel that I should run 2to3 over DWiki's current Python 2 code base while I still could (I had an old conversion from many years ago, but that converted code base was very out of date). One thing led to another, as it often does with me, and I wound up doing a full port and then putting it into production, which is to say serving this blog. I suspect that part of me just felt it was time.

(The 2to3 removal is in the Python 3.13 release notes, and it comes after 2to3 and its infrastructure were deprecated in 3.11 for reasonable reasons.)

As I expected years ago, the stuff that 2to3 could handle was the easy part. Much of the actual work of the port was sorting out the boundary between Unicode strings and byte strings in a Python 3 world. Some of this would have been easier if I'd found PEP 3333 earlier and followed it in my own discount WSGI implementation, but a bunch of it I had to find the hard way, by trying things and having them blow up, sometimes in production.

(I wound up in the same place as PEP 3333 just from the inherent requirements of the web. For example, the HTTP Content-Length is in octets, so if you're using it to read a POST body, the object you're reading from has to be providing bytes. And it turns out that you can't write HTTP headers to a text mode file object because that will turn \r\n sequences into \n, which will make things unhappy with you.)

Not all of the changes were at the IO boundaries of DWiki (and the IO boundaries themselves weren't always simple or obvious). Python 3's handling of cryptographic hashes requires bytes, which rippled through to several places where I use them in DWiki (and the hmac API changed a bit, which wasn't fixed up by 2to3). Python 3 also really wants your regular expressions to be in r"..." strings, because otherwise it will complain about you using regular expression backslash escapes like '\s' that aren't string backslash escapes.

I don't have a DWiki test suite, but long ago I built scripts that would crawl and collect all real pages from an old and a new version of DWiki. I originally used these to check for changes in how pages got rendered when I changed the wikitext processing code (often I wanted no changes), but this time around I was able to use them to verify that the Python 3 DWiki could at least render all existing pages into essentially the same thing (there were \r\n sequences that turned into \n instead of being passed through, but that's probably a good change). But that still left things like writing comments, and also the two sets of code involved in how DWiki runs in production instead of in testing.

I probably wouldn't have tried to do this if I hadn't had a relatively substantial block of free time. It took me more or less all day yesterday to get up to the current production state, with a lot of back and forth, experimentation, and tweaking. There was a lot of code and problem context that I might not have retained if I'd had to slice my work up into half hour or hour long chunks of work, and once I started running the Python 3 version as the live server I was relatively committed to fixing any problems that came up on the spot.

(I could have rolled back to the Python 2 version but it would have been at least a bit awkward for various reasons, including a pickle format change.)

The current Python 3 DWiki code still needs additional cleanups, partly to undo unnecessary 2to3 changes like changing 'for ... in dct.keys():' to 'for ... in list(dct.keys()):'. But it's running stably now for, well, not quite 24 hours yet but for at least a bunch of all of the typical traffic that Wandering Thoughts gets. Probably there aren't any remaining Unicode conversion issues, although re-reading one of my old entries makes me feel I should audit every use of EnvironmentError when dealing with files.

(2to3 appears to always put list() around things that changed to return generators in Python 3. Sometimes this is important, but it's not necessary if the result is only being used in a 'for'.)

I also want to think about what Unicode error handling to use in various circumstances, although these days I'm inclined to be draconian. For example, if someone tries to write a comment with invalid UTF-8, I probably don't want to backslash escape the invalid bits, so the default 'replace' handling is fine (in my case, this comes from using urllib to decode POST bodies). And currently all of the existing content in Wandering Thoughts is UTF-8 clean, at least as far as I can tell.

(The whole Unicode and bytes issue is something where types would be handy (or an option to turn off all of Python 3's implicit conversions), but adding typing to DWiki's 'originated in Python 2' codebase is both a lot of work and also extremely messy, because it uses things in ways that mypy is already unhappy about.)

PS: The Github version of DWiki is now significantly out of date and I'm probably not going to update it for reasons that don't fit in the margins of this entry.

Sidebar: The Python 3 WSGI rules in a nutshell

To summarize PEP 3333 in my own way, HTTP headers are Unicode strings, ie str, but must be limited to iso-8859-1 characters (at least when you write them). The wsgi.input file object produces bytes and your HTTP response body is also bytes. In a CGI environment, you read from sys.stdin.buffer and your WSGI CGI implementation writes to sys.stdout.buffer (including the headers, after encoding to iso-8859-1).

If your WSGI implementation is talking to a network socket, you can and must leave the network socket as a binary file object. In my case, this generally means wsgi.input is created with 'os.fdopen(fd, "rb")'.

A code (reformatting) conundrum in Python, and heuristics

By: cks
12 May 2026 at 02:26

Suppose that you are a Python code reformatter, and someone hands you the following snippet of Python code to act on:

if something:
    blah blah blah
    [...]
    final-line
some-statement

[... more statements ...]

Here's the question: should you reindent 'some-statement' so that it's part of the 'if' block?

One answer is that you absolutely should not. The current code is valid Python code, and you are a reformatter for style, not to correct (presumed) errors. Since this is valid code, you should re-flow line wrapping and so on within blocks, but not change what block valid code is part of.

Another answer is that maybe the person writing this code made a mistake. Style wise, it's common to add a blank line between the end of an indented block and following code; the lack of a blank line suggests that a mistake was made. So maybe you should reindent 'some-statement' to where it properly should be, especially if you have a style rule that says that there should be blank lines in this sort of situation.

(Of course, you could also opt to add the blank line that your style guide says should be there and not change what block a statement goes in. But we're in heuristics territory here.)

If you're a heuristic reformatter, your opinion may change depending on what the 'final-statement' is. For instance, if the final statement in the if block is 'return', it is pretty obvious that there's not supposed to be anything after it. Anything after it is dead code, which would be a different and less likely error. So you should leave 'some-statement' alone and it's valid style to not have a blank line between the last statement in the 'if' block and 'some-statement'.

Python doesn't have all that many statements that definitively end blocks, but it does have some that are extremely suggestive. Consider this pattern of code:

try:
   something
except SomeError:
   pass
some-statement

The pass statement is a no-op, not something that affects control flow, so it's perfectly valid to have statements after a 'pass'; they will be executed normally. At the same time it's commonly used this way when there's not going to be anything after it, so a heuristic Python code formatter that moved 'some-statement' up into the 'except' would make lots of people unhappy.

One such heuristic Python code reformatter is the one used in GNU Emacs in both its conventional python-mode (which 'parses' Python code with regular expressions) and python-ts-mode (which fully parses Python code with a tree-sitter grammar). I'm not sure if these are the same reformatters, but they have the same effects. This particular reformatter heuristic turns out to be the root cause of my Python code reformatting glitches.

(In fact the GNU Emacs Python code reformatting appears to take a 'pass' as a hard end of block and will out-dent anything after it, regardless of which this does to control flow. If you add a 'pass' in the middle of a function and reflow with M-q, GNU Emacs will happily make all statements after it module level ones.)

I experimented with some stand-alone Python code formatters I had sitting around, and none of them behaved this way, which I guess isn't surprising (I tried black, ruff, and yapf). Since the normal pylsp Python LSP server relies on one of them for code reformatting (which one depends on your configuration), this also means LSP-driven code reformatting won't do this. It's possible that only GNU Emacs has this (arguably incorrect) heuristic reformatting.

(I was led to discover all of this by a comment ae left on my earlier entry about Python 2 LSP problems.)

PS: There are other heuristic decisions you can make depending on what 'some-statement' is and where it currently is in the overall block. For example, if 'some-statement' is the last statement in a function and in a 'return', then it's almost certainly correct in its current place. But these heuristics multiply endlessly.

Using a Python 3 LSP server with Python 2 code works (more or less)

By: cks
9 May 2026 at 21:57

I still have a certain amount of Python 2 code, both for work and for personal projects (for example, DWiki, the wiki software behind this blog; it will be Python 3 someday, but not so far). For a long time, I've preferred to do any significant editing of Python code in GNU Emacs, my normal choice for a superintelligent editor, and for a while, I've used LSP based Python editing. There's a very old LSP server for Python 2, but all of the Python LSP servers you actually want to use are specifically for Python 3, and recently I hit a problem that made me turn off the Python 2 LSP server. Since then I've been editing my Python 2 code (cautiously) with pylsp (my normal Python 3 LSP server) and recently, a little bit with 'ruff'. Somewhat to my surprise, this has more or less worked.

My minimum standard for more or less working is that the LSP doesn't malfunction obviously or deluge me with errors and other diagnostics that aren't applicable because it's applying Python 3 rules to Python 2 code. It's even better if the LSP can actually identify real problems, such as misspelled variable names or function names, and recently I've had pylsp do that for some of my code (code that was never tested or used, or I'd have found the problems much earlier; possibly this is a sign that I should have deleted the code instead of fixing it).

(The LSP server does obviously complain about Python 2 code that's using 'print' as a statement, since it's invalid Python 3 syntax, but this is easily fixed even in Python 2 code, and I want to fix it in anything I intend to maintain.)

Much of my Python 2 code mixes spaces and tabs for indentation, and I expected this to upset the Python 3 LSP servers. To my surprise, it hasn't for either pylsp or ruff. Although I can't tell for sure, I think that they're even still correctly interpreting the result (in terms of indentation levels and so on), or at least they're not complaining about syntax errors or other things I'd expect them to if they had the wrong idea of the code's structure.

(Parts of GNU Emacs' python-mode do seem to get confused and (re)indent stuff incorrectly in my old school Python 2 code with 8 space indents and real tabs, which is somewhat surprising. But I guess very few people are editing Python 2 code with tabs in GNU Emacs these days.)

I've done some testing, and as far as I can tell LSP features like 'go to definition' and 'find references' more or less work as I'd expect them to in pylsp. In my (GNU Emacs) environment I think pylsp is limited to cross references within the set of Python files that the editor has loaded and told it about, but within that it's handy.

All of this makes it clearly worthwhile to me to keep LSP stuff enabled for my Python 2 code and to continue to use a superintelligent editor for editing it (although I still make quick changes to Python 2 code with vim). Which is good, because it's also easier and sometimes I'm lazy.

(Work still has Python 2 programs because those programs are load bearing and doesn't particularly need to change, at least most of the time. Could we port them to Python 3? Sure. Could we be sure they didn't have lurking Unicode issues or other problems? No, not necessarily. I did one Python 2 to Python 3 conversion for a load bearing set of programs, our suite of ZFS management tools (including our spares management system), and it was somewhat nerve wracking.)

PS: In my current GNU Emacs environment using Eglot, I don't think the LSP server is called when I hit TAB or M-q (based on the server events reported by eglot-events-buffer), so it's not going to be involved in any rerun of my problem with lsp-mode and the Python 2 LSP server. The LSP server will reindent and reflow the entire file (Emacs buffer), but I have to very specifically ask it to do that. If I have Eglot ask pylsp to reformat a function (selected as a region), pylsp ends back a null result, which I believe means 'no changes', so perhaps pylsp is throwing up its hands at my mixed tabs and spaces indentation.

Learning my lesson that Python virtual environments aren't always movable

By: cks
1 May 2026 at 02:53

I've said before that Python virtual environments can be moved around. Well, technically that entry said 'usually', but in practice I don't remember the limitations I mentioned in that entry. And that is how a while back I renamed the top level directory of a Django virtual environment that I'd also installed the Python LSP server into, and then yesterday I was rather puzzled when I tried some Django development and GNU Emacs gave me a weird error and didn't start my LSP environment.

(Fortunately what I was really doing was seeing how my new Corfu based lsp-mode completion would behave with some Python code.)

The issue is simple: every (Python) program installed into your venv's bin/ directory starts with '#!/path/to/venv/bin/python3', including programs like pylsp, the Python LSP server. They have to do this because they need to run the venv's Python, but that means that they're locked to the original filesystem location of the venv. If you move the venv, either there will be no 'python3' at that path for them to run or worse, you'll be pointing into and using a different venv. Programs outside the venv aren't normally affected, because they're directly using the venv's bin/python3 and the Python interpreter makes that work.

(In my case in GNU Emacs, there was no python3 at the path that pylsp was pointing to, so it failed to start with a weird system message. With no LSP server, Emacs' lsp-mode threw up its hands and gave up.)

Incidentally, this includes the venv's 'pip'. If its '#!' line points to what is now another venv's Python, I believe 'pip install <whatever>' will wind up installing <whatever> into that other venv, not the one you think you're in. This could be anywhere from confusing to somewhat disastrous, depending on what the alternate venv is. Venv name reuse may seem unlikely, but it depends on what your venv naming is like; a worst case option would be something like 'dev-venv' and 'prod-venv', where you remove the old 'prod-venv' venv and rename the 'dev-venv' top level directory to 'prod-venv' (then create a new 'dev-venv' sooner or later).

So far I haven't stubbed my toe on this in anything critical, but it's definitely something I need to remember and it may change how I set up and (don't) move venvs. If I'm going to move venvs very much, it'd be tempting to write something that fixed up all of the '#!' lines in a venv's 'bin/' directory.

(There may already be tools out there that do this, but I'd have to find one of them and Internet search is increasingly bad.)

I should use argument groups in Python's argparse module more than I do

By: cks
2 April 2026 at 02:43

For reasons well outside the scope of this entry, the other day I looked at the --help output from one of my old Python programs. This particular program has a lot of options, but when I'd written it, I had used argparse argument groups to break up the large list of options into logical groups, starting with the most important and running down to the 'you should probably ignore these' ones. The result was far more readable than it would have been without the grouping.

(I want to call these 'option groups', because that's what I use them for.)

I've regularly used mutual exclusion groups in my recent Python programs, but for some reason I've fallen so much out of the habit of using argparse groups to break up walls of options that I'd forgot they even existed until I was reminded by my own program's --help output. Now that I've been reminded, there are probably some programs that I should go back to and add some groups to.

(Most or all of my programs with a lot of options have a structure to them; it's not just a kitchen sink of a lot of things. Even if there is no real structure I can at least separate things into frequent, less frequent, and obscure options.)

Although you can't put either sort of group inside a mutual exclusion group, the argparse documentation is explicit that you can put a mutual exclusion group inside a regular argument group (a detail that I hadn't remembered until I reread my entry on this). Now that I look, one reason to do this is so that you can give the block of mutually exclusive options a title and description that actually tells people that they're mutually exclusive.

(Maybe it would be nicer if a a mutual exclusion group could have an optional title and description, but that's not the API we have.)

As the argparse documentation says, anything not in an argument group is put in the usual sections in your --help. Another way to put this is that the moment you put something in an argument group, it drops down to the bottom of your remaining regular --help output (with a blank line between the regular help and the argument groups). Then each argument group is separated from the next with a blank line, whether or not you gave them a title or a description.

My view is that this can make argument groups a relatively all or nothing thing. If you just want to put a blank line and a title to group your already properly ordered options into digestible chunks, the only ones you can leave out of a group are the first options. After you add the first group, everything afterward has to also be in a group or it will get reordered on you. Fortunately this is easy to do in the sort of code I tend to write to set up argparse stuff, but I'm going to have to remember it when I start adding argument groups to my programs.

(Argparse --help prints options in the order you defined them, so it's conventional to put the most important options first and the least important ones last.)

Going from an IPv4 address to an ASN in Python 2 with Unix brute force

By: cks
25 March 2026 at 02:45

For reasons, I've reached the point where I would like to be able to map IPv4 addresses into the organizations responsible for them, which is to say their Autonomous System Number (ASN), for use in DWiki, the blog engine of Wandering Thoughts. So today on the Fediverse I mused:

Current status: wondering if I can design an on-disk (read only) data structure of some sort that would allow a Python 2 program to efficiently map an IP address to an ASN. There are good in-memory data structures for this but you have to load the whole thing into memory and my Python 2 program runs as a CGI so no, not even with pickle.

(Since this is Python 2, about all I have access to is gdbm or rolling my own direct structure.)

Mapping IP addresses to ASNs comes up a lot in routing Internet traffic, so there are good in-memory data structures that are designed to let you efficiently answer these questions once you have everything loaded. But I don't think anyone really worries about on-disk versions of this information, while it's the case that I care about, although I only care about some ASNs (a detail I forgot to put in the Fediverse post).

Then I had a realization:

If I'm willing to do this by /24 (and I am) and represent the ASNs by 16-bit ints, I guess you can do this with a 32 Mbyte sparse file of two-byte blocks. Seek to a 16-byte address determined by the first three octets of the IP, read two bytes, if they're zero there's no ASN mapping we care about, otherwise they're the ASN in some byte order I'd determine.

If I don't care about the specific ASN, just a class of ASNs of interest of which there are at most 255, it's only 16 Mbytes.

(And if all I care about is a yes or know answer, I can represent each /24 by a bit, so the storage required drops even more, to only 2 Mbytes.)

This Fediverse post has a mistake. I thought ASNs were 16-bit numbers, but we've gone well beyond that by now. So I would want to use the one-byte 'class of ASN' approach, with ASNs I don't care about mapping to a class of zero. Alternately I could expand to storing three bytes for every /24, or four bytes to stay aligned with filesystem blocks.

That storage requirement is 'at most' because this will be a Unix sparse file, where filesystem blocks that aren't written to aren't stored on disk; when read, the data in them is all zero. The lookup is efficient, at least in terms of system calls; I'd open the file, lseek() to the position, and read two bytes (causing the system to read a filesystem block, however big that is). Python 2 doesn't have access to pread() or we could do it in one system call.

Within the OS this should be reasonably efficient, because if things are active much of the important bits of the mapping file will be cached into memory and won't have to be read from disk. 32 Mbytes is nothing these days, at least in terms of active file cache, and much of the file will be sparse anyway. The OS obviously has reasonably efficient random access to the filesystem blocks of the file, whether in memory or on disk.

This is a fairly brute force approach that's only viable if you're typically making a single query in your process before you finish. It also feels like something that is a good fit for Unix because of sparse files, although 16 Mbytes isn't that big these days even for a non-sparse file.

Realizing the brute force approach feels quite liberating. I've been turning this problem over in my mind for a while but each time I thought of complicated data structures and complicated approaches and it was clear to me that I'd never implement them. This way is simple enough that I could actually do it and it's not too impractical.

PS: I don't know if I'll actually build this, but every time a horde of crawlers descends on Wandering Thoughts from a cloud provider that has a cloud of separate /24s and /23s all over the place, my motivation is going to increase. If I could easily block all netblocks of certain hosting providers all at once, I definitely would.

(To get the ASN data there's pyasn (also). Conveniently it has a simple on-disk format that can be post-processed to go from a set of CIDRs that map to ASNs to a data file that maps from /24s to ASN classes for ASNs (and classes) that I care about.)

Update: After writing most of this entry I got enthused and wrote a stand-alone preliminary implementation (initially storing full ASNs in four-byte records), which can both create the data file and query it. It was surprisingly straightforward and not very much code, which is probably what I should have expected since the core approach is so simple. With four-byte records, a full data file of all recent routes from pyasn is about 53 Mbytes and the data file can be created in less than two minutes, which is pretty good given that the code writes records for about 16.5 million /24s.

(The whole thing even appears to work, although I haven't strongly tested it.)

One problem with (Python) docstrings is that they're local

By: cks
19 March 2026 at 02:40

When I wrote about documenting my Django forms, I said that I knew I didn't want to put my documentation in docstrings, because I'd written some in the past and then not read it this time around. One of the reasons for that is that Python docstrings have to be attached to functions, or more generally, Python docstrings have to be scattered through your code. The corollary to this is that to find relevant docstrings you have to read through your code and then remember which bits of it are relevant to what you're wondering about.

When your docstring is specifically about the function you already know you want to look at, this is fine. Docstrings work perfectly well for local knowledge, for 'what is this function about' summaries that you want to read before you delve into the function. I feel they work rather less well for finding what function you want to look at (ideally you want some sort of skimmable index for that); if you have to read docstrings to find a function, you're going to be paging through a lot of your code until you hit the right docstring.

This is also why I feel docstrings are a bad fit for documenting my Django forms. Even if I attach them to the Python functions that handle each particular form, the resulting documentation is going to be mingled with my code and spread all through it. Not only is there no overview, but I'd have to skip around my code as I read about how one form interacts with another; there's no single place where I can read about the flow of forms, one leading to another.

(This is the case even if all of the form handling functions are in one spot with nothing between them, because the docstrings will be split up by the code itself and the comments in the code.)

Another issue is that sensible docstrings can only be so big, because they separate the function's 'def' statement from its actual code. You don't want those two too far apart, which pushes docstrings toward being relatively concise. My feeling is that if I have a lot to say about what the function is used for or how it relates to other things, I can't really put it in a docstring. I usually put it in a comment in front of the function (which means that some of my Python code has a mixture of comments and docstrings). The less a function can be described purely by itself (and concisely), the more its docstring is going to sprawl and the more awkward that gets.

(Docstrings on functions are also generally seen as what I could call external documentation, written for people who might want to call the function and understand how it relates to other functions they might also use. Comments are the usual form of internal documentation that you want at hand while reading the function's code.)

It's conventional to say that docstrings are documentation for what they're on. I think it's better to say that docstrings are summaries. Some things can be described purely through summaries (with additional context that the programmer is assumed to have), but not everything can be.

(Comments before a function are also local to some degree, but they intrude less on the function's code since they don't put themselves between 'def' and the rest of things.)

You (I) should document the forms of your Django web application

By: cks
13 March 2026 at 03:18

We have a long-standing Django web application to handle (Unix) account requests. Since these are requests, there is some state involved, so for a long time a request could be pending, approved, or rejected, with the extra complexity that an approved request might be incomplete and waiting on the person to pick their login. Recently I added being able to put a request into a new state, 'held', in order to deal with some local complexities where we might have a request that we didn't want to delete but also didn't want to go through to create an account.

(For instance, it's sometimes not clear if new incoming graduate students who've had to defer their arrival are going to turn up later or wind up not coming at all. So now we can put their requests on hold.)

When I initially wrote the new code, I though that this new 'held' status was relatively weak, and in particular that professors (who approve accounts) could easily take an account request out of 'held' status and approve it. At the time I decided that this was probably a feature, since a professor might know that one of their graduate students was about to turn up after all and this way they didn't have to get us to un-hold the account request. Then the other day we sort of wanted to hold an account request even against the professor involved approving it, and because I knew that the 'held' status was weak this way, I didn't bother trying.

Well, it turns out I was wrong. Because I had forgotten how our forms worked, I hadn't realized that my new 'held' status was less 'held' and more 'frozen', and I only learned better today because I took a stab at creating a real 'frozen' status. In the current state, while it's possible for professors to deliberately un-hold a request, it takes a certain amount of work to find the one obscure place it's possible and you can't do it by accident (and it would be easy to close that possibility off if we decided to). You definitely can't accidentally approve a request that's currently held without realizing it.

(So my admittedly modest amount of work to add a 'frozen' status was sort of wasted, although it did lead to greater understanding in the end.)

Past me, immersed in the application, presumably found all of the rules about who could see what form and what they showed to be obvious (at least in context). Present me is a long distance from past me and did not remember all of those things. Brief documentation on each form would have been really quite handy, and if I'm smart I'll spend some time this time around to write some.

I'm not sure where I'll put any new forms documentation. Probably not in our views.py, which is already big enough. I could put it in urls.py, or I could write a separate README.forms file that doesn't try to embed this in code. And I know that I don't want to put it in Python docstrings, because I wrote some things in Python docstrings on the existing forms functions and then didn't read them. Even if I had read them, the existing docstrings don't entirely cover the sort of information I now know I want to know.

(I think there's a good reason for my not reading my own docstrings, but that's for another entry.)

Sometimes, non-general solutions are the right answer

By: cks
5 March 2026 at 03:33

I have a Python program that calculates and prints various pieces of Linux memory information on a per-cgroup basis. In the beginning, its life was simple; cgroups had a total memory use that was split between 'user' and '(filesystem) cache', so the program only needed to display either one field or a primary field plus a secondary field. Then I discovered that there was additional important (ie, large) kernel memory use in cgroups and added the ability to report it as an additional option for the secondary field. However, this wasn't really ideal, because now I had a three-way split and I might want to see all three things at once.

A while back I wrote up my realization about flexible string formatting with named arguments. This sparked all sorts of thoughts about writing a general solution for my program that could show any number of fields. Recently I took a stab at implementing this and rapidly ran into problems figuring out how I wanted to do it. I had multiple things that could be calculated and presented, I had to print not just the values but also a header with the right field names, I'd need to think about how I structured argparse argument groups in light of argparse not supporting nested groups, and so on. At a minimum this wasn't going to be a quick change; I was looking at significantly rewriting how the program printed its output.

The other day, I had an obvious realization: while it would be nice to have a fully general solution that could print any number of additional fields, which would meet my needs now and in the future, all that I needed right now was an additional three-field version with the extra fields hard-coded and the whole thing selected through a new command line argument. And this command line argument could drop right into the existing argparse exclusive group for choosing the second field, even though this feels inelegant.

(The fields I want to show are added with '-c' and '-k' respectively in the two field display, so the morally correct way to select both at once would be '-ck', but currently they're exclusive options, which is enforced by argparse. So I added a third option, literally '-b' for 'both'.)

Actually implementing this hard-coded version was a bit annoying for structural reasons, but I put the whole thing together in not very long; certainly it was much faster than a careful redesign and rewrite (in an output pattern I haven't used before, no less). It's not necessarily the right answer for the long term, but it's definitely the right answer for now (and I'm glad I talked myself into doing it).

(I'm definitely tempted to go back and restructure the whole output reporting to be general. But now there's no rush to it; it's not blocking a feature I want, it's a cleanup.)

Parsing hours and minutes into a useful time in basic Python

By: cks
20 February 2026 at 03:48

Suppose, not hypothetically, that you have a program that optionally takes a time in the past to, for example, report on things as of that time instead of as of right now. You would like to allow people to specify this time as just 'HH:MM', with the meaning being that time today (letting people do 'program --at 08:30'). This is convenient for people using your program but irritatingly hard today with the Python standard library.

(In the following code examples, I need a Unix timestamp and we're working in local time, so I wind up calling time.mktime(). We're working in local time because that's what is useful for us.)

As I discovered or noticed a long time ago, the time module is a thin shim over the C library time functions and inherits their behavior. One of these behaviors is that if you ask time.strptime() to parse a time format of '%H:%M', you get back a struct_time object that is in 1900:

>>> import time
>>> time.strptime("08:10", "%H:%M")
time.struct_time(tm_year=1900, tm_mon=1, tm_mday=1, tm_hour=8, tm_min=10, tm_sec=0, tm_wday=0, tm_yday=1, tm_isdst=-1)

There are two solutions I can think of, the straightforward brute force approach that uses only the time module and a more theoretically correct version using datetime, which comes in two variations depending on whether you have Python 3.14 or not.

The brute force solution is to re-parse a version of the time string with the date added. Suppose that you have a series of time formats that people can give you, including '%H:%M', and you try them all until one works, with code like this:

 for fmt in tfmts:
     try:
         r = time.strptime(tstr, fmt)
         # Fix up %H:%M and %H%M
         if r.tm_year == 1900:
             dt = time.strftime("%Y-%m-%d ", time.localtime(time.time()))
             # replace original r with the revised one.
             r = time.strptime(dt + tstr, "%Y-%m-%d "+fmt)
         return time.mktime(r)
     except ValueError:
         continue

I think the correct, elegant way using only the standard library is to use datetime to combine today's date and the parsed time into a correct datetime object, which can then be turned into a struct_time and passed to time.mktime. Before Python 3.14, I believe this is:

         r = time.strptime(tstr, fmt)
         if r.tm_year == 1900:
             tm = datetime.time(hour=r.tm_hour, minute=r.tm_min)
             today = datetime.date.today()
             dt = datetime.datetime.combine(today, tm)
             r = dt.timetuple()
         return time.mktime(r)

There are variant approaches to the basic transformation I'm doing here but I think this is the most correct one.

If you have Python 3.14 or later, you have datetime.time.strptime() and I think you can do the slightly clearer:

[...]
             tm = datetime.time.strptime(tstr, fmt)
             today = datetime.date.today()
             dt = datetime.datetime.combine(today, tm)
             r = dt.timetuple()
[...]

If you can work with datetime.datetime objects, you can skip converting back to a time.struct_time object. In my case, the eventual result I need is a Unix timestamp so I have no choice.

You can wrap this up into a general function:

def strptime_today(tstr, fmt):
   r = time.strptime(tstr, fmt)
   if r.tm_year != 1900:
      return r
   tm = datetime.time(hour=r.tm_hour, minute=r.tm_min, second=r.tm_sec)
   today = datetime.date.today()
   dt = datetime.datetime.combine(today, tm)
   return dt.timetuple()

This version of time.strptime() will return the time today if given a time format with only hours, minutes, and possibly seconds. Well, technically it will do this if given any format without the year, but dealing with all of the possible missing fields is left as an exercise for the energetic, partly because there's no (relatively) reliable signal for missing months and days the way there is for years. For many programs, a year of 1900 is not even close to being valid and is some sort of mistake at best, but January 1st is a perfectly ordinary day of the year to care about.

(Now that I've written this function I may update my code to use it, instead of the brute force time package only version.)

A fun Python puzzle with circular imports

By: cks
10 February 2026 at 04:12

Baptiste Mispelon asked an interesting Python quiz (via, via @glyph):

Can someone explain this #Python import behavior?
I'm in a directory with 3 files:

a.py contains `A = 1; from b import *`
b.py contains `from a import *; A += 1`
c.py contains `from a import A; print(A)`

Can you guess and explain what happens when you run `python c.py`?

I encourage you to guess which of the options in the original post is the actual behavior before you read the rest of this entry.

There are two things going on here. The first thing is what actually happens when you do 'from module import ...'. The short version is that this copies the current bindings of names from one module to another. So when module b does 'from a import *', it copies the binding of a.A to b.A and then the += changes that binding. The behavior would be the same if we used 'from a import A' and 'from b import A' in the code, and if we did we could describe what each did in isolation as starting with 'A = 1' (in a), then 'A = a.A; A += 2' (in b), and then 'A = b.A' (back in a) successively (and then in c, 'A = a.A').

The second thing going on is that you can import incomplete modules (this is true in both Python 2 and Python 3, which return the same results here). To see how this works we need to combine the description of 'import' and 'from' and the approximation of what happens during loading a module, although neither is completely precise. To summarize, when a module is being loaded, the first thing that happens is that a module namespace is created and is added to sys.modules; then the code of the module is executed in that namespace. When Python encounters a 'from', if there is an entry for the module in sys.modules, Python immediately imports things from it; it implicitly assumes that the module is already fully loaded.

At first I was surprised by this behavior, but the more I think about it the more it seems a reasonable choice. It avoids having to explicitly detect circular imports and it makes circular imports work in the simple case (where you do 'import b' and then don't use anything from b until all imports are finished and the program is running). It has the cost that if you have circular name uses you get an unhelpful error message about 'cannot import name' (or 'NameError: name ... is not defined' if you use 'from module import *'):

$ cat a.py
from b import B; A = 10 + B
$ cat b.py
from a import A; B = 20 + A
$ cat c.py
from a import A; print(A)
$ python c.py
[...]
ImportError: cannot import name 'A' from 'a' [...]

(Python 3.13 does print a nice stack trace the points to the whole set of 'from ...' statements.)

Given all of this, here is what I believe is the sequence of execution in Baptiste Mispelon's example:

  1. c.py does 'from a import A', which initiates a load of the 'a' module.
  2. an 'a' module is created and added to sys.modules
  3. that module begins executing the code from a.py, which creates an 'a.A' name (bound to 1) and then does 'from b import *'.
  4. a 'b' module is created and added to sys.modules.
  5. that module begins executing the code from b.py. This code starts by doing 'from a import *', which finds that 'sys.modules["a"]' exists and copies the a.A name binding, creating b.A (bound to 1).
  6. b.py does 'A += 1', which mutates the b.A binding (but not the separate a.A binding) to be '2'.
  7. b.py finishes its code, returning control to the code from a.py, which is still part way through 'from b import *'. This import copies all names (and their bindings) from sys.modules["b"] into the 'a' module, which means the b.A binding (to 2) overwrites the old a.A binding (to 1).
  8. a.py finishes and returns control to c.py, where 'from a import A' can now complete by copying the a.A name and its binding into 'c', make it the equivalent of 'import a; A = a.A; del a'.
  9. c.py prints the value of this, which is 2.

At the end of things, there is all of c.A, a.A, and b.A, and they are bindings to the same object. The order of binding was 'b.A = 2; a.A = b.A; c.A = a.A'.

(There's also a bonus question, where I have untested answers.)

Sidebar: A related circular import puzzle and the answer

Let's take a slightly different version of my error message example above, that simplifies things by leaving out c.py:

$ cat a.py
from b import B; A = 10 + B
$ cat b.py
from a import A; B = 20 + A
$ python a.py
[...]
ImportError: cannot import name 'B' from 'b' [...]

When I first did this I was quite puzzled until the penny dropped. What's happening is that running 'python a.py' isn't creating an 'a' module but instead a __main__ module, so b.py doesn't find a sys.modules["a"] when it starts and instead creates one and starts loading it. That second version of a.py, now in an "a" module, is what tries to refer to b.B and finds it not there (yet).

Why I'm ignoring pretty much all new Python packaging tools

By: cks
30 January 2026 at 03:59

One of the things going on right now is that Python is doing a Python developer survey. On the Fediverse, I follow a number of people who do Python stuff, and they've been posting about various aspects of the survey, including a section on what tools people use for what. This gave me an interesting although very brief look into a world that I'm deliberately ignoring, and I'm doing that because I feel my needs are very simple and are well met by basic, essentially universal tools that I already know and have.

Although I do some small amount of Python programming, I'm not a Python developer; you could call me a consumer of Python things, both programs and packages. The thing I do most is use programs written in Python that aren't single-file, dependency free things, almost always for my own personal use (for example, asncounter and the Python language server). The tool I use for almost all of these is pipx, which I feel handles pretty much everything I could ask for and comes pre-packaged in most Linuxes. Admittedly I've written some tools to make my life nicer.

(One important think pipx does is install each program separately. This allows me to remove one clearly and also to use PyPy or CPython as I prefer on a program by program basis.)

For programs that we want to use as part of our operations (for example), the modern, convenient approach is to make a venv and then install the program into it with pip. Pip is functionally universal and the resulting venvs effectively function as self contained artifacts that can be moved or put anywhere (provided that we stick to the same Ubuntu LTS version). So far we haven't tried to upgrade these in place; if a new version of the program comes out, we build a new venv and swap which one is used.

(It's possible that package dependencies of the program could be updated even if it hasn't released a new version, but we treat these built venvs as if they were compiled binaries; once produced, they're not modified.)

Finally, our Django based web application now uses a Django setup where Django is installed into a venv and then the production tree of our application lives outside that venv (previously we didn't use venvs at all but that stopped working). Our application isn't versioned or built into a Python artifact; it's a VCS tree and is managed through VCS operations. The Django venv is created separately, and I use pip for that because again pip is universal and familiar. This is a crude and brute force approach but it's also ensured that I haven't had to care about the Python packaging ecosystem (and how to make Python packages) for the past fifteen years. At the moment we use only standard Django without any third party packages that we'd also have to add to the venv and manage, and I expect that we're going to stay that way. A third party package would have to be very attractive (or become extremely necessary) in order for us to take it on and complicate life.

I'm broadly aware that there are a bunch of new Python package management and handling tools that go well beyond pip and pipx in both performance and features. My feeling so far is that I don't need anything more than I have and I don't do the sort of regular Python development where the extra features the newer tools have would make a meaningful difference. And to be honest, I'm wary of some or all of these turning out to be a flavour of the month. My mostly outside impression is that Python packaging and package management has had a great deal of churn over the years, and from seeing the Go ecosystem go through similar things from closer up I know that being stuck with a now abandoned tool is not particularly fun. Pip and pipx aren't the modern hot thing but they're also very unlikely to go away.

Python 2, GNU Emacs, and my LSP environment combine to shoot me in the foot

By: cks
26 December 2025 at 21:50

So I had a thing happen:

This is my angry face that GNU Emacs appears to have re-indented my entire Python file to a different standard without me noticing and I didn't catch it in time. And also it appears impossible in GNU Emacs to FIX this. I do not want four space no tabs, this is historical code that all files should be eight spaces with tabs (yes, Python 2).

That 'Python 2' bit turns out to be load-bearing. The specific problem turned out to be that if I hit TAB with a region selected or M-q when GNU Emacs point was outside a comment, the entire file was reformatted to modern 4-space indents (and long expressions got linewrapped, and some other formatting changes). I'm not sure which happened to trigger the initial reformatting that I didn't notice in time, but I suspect I was trying to use M-q to reflow a file level comment block and had my cursor (point) in the wrong spot. My TAB and M-q bindings are standard, and when I investigated deeply enough I discovered that this was LSP related.

The first thing I learned is that just 'turning off' LSP mode with 'lsp-mode' (or 'M-: (lsp-mode -1))' isn't enough to actually turn off LSP based indentation handling. This is discussed in lsp-mode issue #824, and apparently the solution is some combination of deactivating an additional minor mode, invoking lsp-disconnect through M-x (or using the 's-l w D' key binding if you have Super available), or setting lsp-enable-indentation to 'nil' (probably as a buffer-local variable, although tastes may differ).

The second thing I discovered is that in my environment this doesn't happen for Python 3 code. With my normal Python 3 GNU Emacs LSP environment, using python-lsp-server (pylsp) (also), the LSP environment will make no changes and report 'No formatting changes provided'. My problem only happens in Python 2 buffers, and that's because in Python 2 buffers I wasn't using pylsp (which only officially supports Python 3 code) but instead the older and now unsupported pyls. Either pyls has always behaved differently than pylsp when the LSP server asks it to do formatting stuff, or at some point the LSP protocol and expectations around formatting actions changed and pyls (which has been unmaintained since 2020) didn't change to keep up.

My immediate fix was to set lsp-enable-indentation to nil in my GNU Emacs lsp-mode hook for python-mode. As a longer term thing I'm going to experiment with using pylsp even for Python 2 code, to see how it goes. Otherwise I may wind up disabling LSP for Python 2 code and buffers, although that's somewhat tricky since there's no explicit separate settings for Python 2 versus Python 3. Another immediate fix is that in the future I may be editing this particular code base more in vi(m) or perhaps sam than GNU Emacs.

(My Python 2 code is mostly or entirely written using tabs for indentation, so the presence of leading tabs is a reliable way of detecting 'Python 2' code.)

PS: This particular Python 2 program is DWiki, the wiki engine underlying Wandering Thoughts, so while it will move to Python 3 someday and I once got a hacked version vaguely running that way, it's not going to happen any time soon for multiple reasons.

String formatting with named format arguments and format flexibility

By: cks
14 December 2025 at 03:43

Suppose, not entirely hypothetically, that you have a tool that prints out records (one per line) and each record has a bunch of information associated with it, which you print out in columns. You'd like to provide a way for people to control which columns of information are printed for the records. If there's only a few options, maybe you can do this with a few different format strings using the traditional "%s %s %s" approach of positional formatting (because you're old fashioned and haven't really updated to the modern world of string formatting), but this doesn't really scale up very well; you rapidly get into a massive explosion of options and formatting.

As I was contemplating exactly this issue for a tool of mine, it belatedly occurred to me that the solution I wanted was named format arguments, instead of positional ones. Named format arguments have two great advantages here. First, you can shuffle the order that they occur in within the format string without having to change the arguments. Second, you don't have to use all of them; Python is perfectly happy if you supply extra named arguments to your string formatting that aren't used.

This means that you can simply build up a big dictionary of all of your available information for a given record (perhaps even in multiple formats, for example if you have an option to print numbers precisely or abbreviate them to K, M, G, and so on), and then either pick a formatting string or assemble it from pieces based on what columns you want to print (and how). Then you can just do the actual formatting with:

outstr = fmtstr.format_map(datadict)

It doesn't matter that you supplied (way) more information in your datadict than your assembled or chosen format string uses, or what order your format string puts things. Everything just works.

(You can use 'fmtstr % datadict' instead if you want to. I'm not sure which I'll use, but a bit of me feels that I should switch to modern Python string formatting instead of sticking with the old printf style of '%', even if it allows named arguments too.)

This feels like something that I should have realized long ago, back when named ('keyword') format arguments were added to Python, but for some reason it never clicked until now. Several of my programs are probably going to start providing a lot more options for formatting their output.

Noticing a shift in Python idioms, or my use of them

By: cks
11 December 2025 at 03:15

For reasons outside the scope of this entry, I was recently reminded of some very old entries here where I compared some Python code with some Perl code to do the same thing. One of the things that stood out to me is that way back then I said:

For example, I could have written 'print "\n".join(rr.strings)' in Python, but it doesn't feel right; I would rather write the for loop explicitly instead.

At some point between back then and now, my views on this changed without me noticing. Today I would unhesitatingly print a multi-line list of text (ie a list of lines) using the .join() version, and in fact I have; I can easily find little utility programs of mine that use this idiom (some of them a significant number of years old by now, so I don't think this is a recent shift).

What I don't know is if this was a shift in my personal views or if Python in general shifted its view of this idiom. At least some Python code seems to have been using this a long time ago, so it's entirely possible that I'm what changed and this was always considered idiomatic Python.

(My suspicion today is that '"\n".join()' probably always was idiomatic Python, at least in Python 2 and later. It's not quite as clear as a for loop but it's much more compact.)

There are probably lots of other Python idioms where either I or Python as a whole has shifted our views on over time. But for various reasons I rarely get my attention shoved into them the way I did this time. We do have a certain amount of old Python code that we're still using, but because it's old and reliable, I generally don't have any reason to look at it and think about the idioms it uses.

All of this makes me wonder what Python idioms I'm currently not using and thinking about that I'll consider perfectly natural and automatic in five or ten years. I should probably be using dataclasses, and then there's copious use of typing annotations (which would probably feel more natural to me if I used them frequently).

(I have a very old and now abandoned Python program, but I'm not energetic enough to pick through its code. Also, it would probably be slightly depressing.)

You can't (easily) ignore errors in Python

By: cks
22 November 2025 at 04:18

Yesterday I wrote about how there's always going to be a way to not write code for error handling. When I wrote that entry I deliberately didn't phrase it as 'ignoring errors', because in some languages it's either not possible to do that or at least very difficult, and one of them is Python.

As every Python programmer knows, errors raise exceptions in Python and you can catch those exceptions, either narrowly or (very) broadly (possibly by accident). If you don't handle an exception, it bubbles up and terminates your program (which is nice if that's what you want and does mean that errors can't be casually ignored). On the surface it seems like you can ignore errors by simply surrounding all of your code with a try:/except: block that catches everything. But if you do this, you're not ignoring errors in the same way as you do in a language where errors are return values. In a language where you can genuinely ignore errors, all of your code keeps on running when errors happen. But in Python, if you put a broad try block around your code, your code stops executing at the first exception that gets raised, rather than continuing on to the other code within the try block.

(If there's further code outside the try block, it will run but probably not work very well because there will likely be a lot that simply didn't happen inside the try block. Your code skipped right from the statement that raised the exception to the first statement outside the try block.)

To get the C or Go like experience that your program keeps running its code even after an exception, you need to effectively catch and ignore exceptions separately for each statement. You can write this out by hand, putting each statement in its own try: block, but you'll probably get tired of this very fast, the result will be hard to read, and it's extremely obviously not like regular Python. This is the sign that Python doesn't really let you ignore errors in any easy way. All Python lets you do easily is suppress messages about errors and potentially make them not terminate your program. The closer you want to get to actually ignoring all errors, the more work you'll have to do.

(There are probably clever things you can do with Python debugging hooks since I believe that Python debuggers can intercept exceptions, although I'm not sure if they can resume execution after unhandled ones. But this is not going to really be easy.)

My script to 'activate' Python virtual environments

By: cks
14 November 2025 at 03:27

After I wrote about Python virtual environments and source code trees, I impulsively decided to set up the development tree of our Django application to use a Django venv instead of a 'pip install --user' version of Django. Once I started doing this, I quickly decided that I wanted a general script that would switch me into a venv. This sounds a little bit peculiar if you know Python virtual environments so let me explain.

Activating a Python virtual environment mostly means making sure that its 'bin' directory is first on your $PATH, so that 'python3' and 'pip' and so on come from it. Venvs come with files that can be sourced into common shells in order to do this (with the one for Bourne shells called 'activate'), but for me this has three limits. You have to use the full path to the script, they change your current shell environment instead of giving you a new one that you can just exit to discard this 'activation', and I use a non-standard shell that they don't work in. My 'venv' script is designed to work around all three of those limitations. As a script, it starts a new shell (or runs a command) instead of changing my current shell environment, and I set it up so that it knows my standard place to keep virtual environments (and then I made it so that I can use symbolic links to create 'django' as the name of 'whatever my current Django venv is').

(One of the reasons I want my 'venv' command to default to running a shell for me is that I'm putting the Python LSP server into my Django venvs, so I want to start GNU Emacs from an environment with $PATH set properly to get the right LSP server.)

My initial version only looked for venvs in my standard location for development related venvs. But almost immediately after starting to use it, I found that I wanted to be able to activate pipx venvs too, so I added ~/.local/pipx/venvs to what I really should consider to be a 'venv search path' and formalize into an environment variable with a default value.

I've stuffed a few other features into the venv script. It will print out the full path to the venv if I ask it to (in addition to running a command, which can be just 'true'), or something to set $PATH. I also found I sometimes wanted it to change directory to the root of the venv. Right now I'm still experimenting with how I want to build other scripts on top of this one, so some of this will probably change in time.

One of my surprises about writing the script is how much nicer it's made working with venvs (or working with things in venvs). There's nothing it does that wasn't possible before, but the script has removed friction (more friction than I realized was there, which is traditional for me).

PS: This feels like a sufficiently obvious idea that I suspect that a lot of people have written 'activate a venv somewhere along a venv search path' scripts. There's unlikely to be anything special about mine, but it works with my specific shell.

Python virtual environments and source code trees

By: cks
10 November 2025 at 04:22

Python virtual environments are mostly great for actually deploying software. Provided that you're using the same version of Python (3) everywhere (including CPU architecture), you can make a single directory tree (a venv) and then copy and move it around freely as a self-contained artifact. It's also relatively easy to use venvs to switch the version of packages or programs you're using, for example Django. However, venvs have their frictions, at least for me, and often I prefer to do Python development outside of them, especially for our Django web application).

(This means using 'pip install --user' to install things like Django, to the extent that it's still possible.)

One point of friction is in their interaction with working on the source code of our Django web application. As is probably common, this source code lives in its own version control system controlled directory tree (we use Mercurial for this for reasons). If Django is installed as a user package, the native 'python3' will properly see it and be able to import Django modules, so I can directly or indirectly run Django commands with the standard Python and my standard $PATH.

If Django is installed in a venv, I have two options. The manual way is to always make sure that this Django venv is first on my $PATH before the system Python, so that 'python3' is always from the venv and not from the system. This has a little bit of a challenge with Python scripts, and is one of the few places where '#!/usr/bin/env python3' makes sense. In my particular environment it requires extra work because I don't use a standard Unix shell and so I can't use any of the venv bin/activate things to do all the work for me.

The automatic way is to make all of the convenience scripts that I use to interact with Django explicitly specify the venv python3 (including for things like running a test HTTP server and invoking local management commands), which works fine since a program can be outside the venv it uses. This leaves me with the question of where the Django venv should be, and especially if it should be outside the source tree or in a non-VCS-controlled path inside the tree. Outside the source tree is the pure option but leaves me with a naming problem that has various solutions. Inside the source tree (but not VCS controlled) is appealingly simple but puts a big blob of otherwise unrelated data into the source tree.

(Of course I could do both at once by having a 'venv' symlink in the source tree, ignored by Mercurial, that points to wherever the Django venv is today.)

Since 'pip install --user' seems more and more deprecated as time goes by, I should probably move to developing with a Django venv sooner or later. I will probably use a venv outside the source tree, and I haven't decided about an in-tree symlink.

(I'll still have the LSP server problem but I have that today. Probably I'll install the LSP server into the Django venv.)

PS: Since this isn't a new problem, the Python community has probably come up with some best practices for dealing with it. But in today's Internet search environment I have no idea how to find reliable sources.

My mistake with swallowing EnvironmentError errors in our Django application

By: cks
1 November 2025 at 02:50

We have a little Django application to handle request for Unix accounts. Once upon a time it was genuinely little, but it's slowly accreted features over the years. One of the features it grew over the years was a command line program (a Django management command) to bulk-load account request information from files. We use this to handle things like each year's new group of incoming graduate students; rather than force the new graduate students to find the web form on their own, we get information on all of them from the graduate program people and load them into the system in bulk.

One of the things that regularly happens with new graduate students is that they were already involved on the research side of the department. For example, as an undergraduate you might work on a research project with a professor, and then you get admitted as a graduate student (maybe with that professor, or maybe with someone else). When this happens, the new graduate student already has an account and we don't want to give them another one (for various reasons). To detect situations where someone already has an existing account, the bulk loader reads some historical data out of a couple of files and looks through it to match any existing accounts to the new graduate students.

When I originally wrote the code to load data from files, for some reason I decided that it wasn't particular bad if the files didn't exist or couldn't be read, so I wrote code that looked more or less like this:

try:
  fp = open(fname, "r")
  [process file]
  fp.close()
except EnvironmentError:
  pass

Of course, for testing purposes (and other reasons, for example to suppress this check) we should be able to change where the data files were read from, so I made the file names of the data files be argparse options, set the default values to the standard locations that the production application recorded things, and called it all good.

Except that for the past two years, one of the default file names was wrong; when I added this specific file, I made a typo in the file name. Using the command line option to change the file name worked so this passed my initial testing when I added the specific type of historical data, but in production, using my typo'd default file name, we silently never detected existing Unix logins for new graduate students (and others) through this particular type of historical data.

All of this happened because I made a deliberate design decision to silently swallow all EnvironmentError exceptions when trying to open and read these files, instead of either failing or at least reporting a warning. When I made the decision (back in 2013, it turns out), I was probably thinking that the only source of errors was if you ran it as the wrong user or deliberately supplied nonexistent files; I doubt it ever occurred to me that I could make an embarrassing typo in the name of any of the production files. One of the lessons I draw from this is that I don't always even understand the possible sources of errors, which makes it all the more dangerous to casually ignore them.

(Even silently ignoring nonexistent files is rather questionable in retrospect. I don't really know what I was thinking in 2013.)

Our Django model class fields should include private, internal names

By: cks
21 September 2025 at 01:30

Let me tell you about a database design mistake I made in our Django web application for handling requests for Unix accounts. Our current account request app evolved from a series of earlier systems, and one of the things that these earlier systems asked people for was their 'status' with the university; were they visitors, graduate students, undergraduate students, (new) staff, or so on. When I created the current system I copied this and so the database schema includes a 'Status' model class. The only thing I put in this model class was a text field that people picked from in our account request form, and I didn't really think of the text there as what you could call load bearing. It was just a piece of information we asked people for because we'd always asked people for and faithfully duplicating the old CGI was the easy way to implement the web app.

Before too long, it turned out that we wanted to do some special things if people were graduate students (for example, notifying the department's administrative people so they could update their records to include the graduate student's Unix login and email address here). The obvious simple way to implement this was to do a text match on the value of the 'status' field for a particular person; if their 'status' was "Graduate Student", we knew they were a graduate student and we could do various special things. Over time, this knowledge of what the people-visible "Graduate Student" status text was wormed its way into a whole collection of places around our account systems.

For reasons beyond the scope of this entry, we now (recently) want to change the people-visible text to be not exactly "Graduate Student" any more. Now we have a problem, because a bunch of places know that exact text (in fact I'm not sure I remember where all of those places are).

The mistake I made, way back when we first wanted things to know that an account or account request was a 'graduate student', was in not giving our 'Status' model an internal 'label' field that wasn't shown to people in addition to the text shown to people. You can practically guarantee that anything you show to people will want to change sooner or later, so just like you shouldn't make actual people-exposed fields into primary or foreign keys, none of your code should care about their value. The correct solution is an additional field that acts as the internal label of a Status (with values that make sense to us), and then using this internal label any time the code wants to match on or find the 'Graduate Student' status.

(In theory I could use Django's magic 'id' field for this, since we're having Django create automatic primary keys for everything, including the Status model. In practice, the database IDs are completely opaque and I'd rather have something less opaque in code instead of everything knowing that ID '14' is the Graduate Student status ID.)

Fortunately, I've had a good experience with my one Django database migration so far, so this is a fixable problem. Threading the updates through all of the code (and finding all of the places that need updates, including in outside programs) will be a bit of work, but that's what I get for taking the quick hack approach when this first came up.

(I'm sure I'm not the only person to stub my toe this way, and there's probably a well known database design principle involved that would have told me better if I'd known about it and paid attention at the time.)

Argparse will let you have multiple long (and short) options for one thing

By: cks
30 August 2025 at 03:19

Argparse is the standard Python module for handling (Unix style) command line options, in the expected way (which not all languages follow). Or at least more or less the expected way; people are periodically surprised that by default argparse allows you to abbreviate long options (although you can safely turn that off if you assume Python 3.8 or later and you remember this corner case).

What I think of as the typical language API for specifying short and long options allows you to specify (at most) one of each; this is the API of, for example, the Go package I use for option handling. When I've written Python programs using argparse, I've followed this usage without thinking very much about it. However, argparse doesn't actually require you to restrict yourself this way. The addargument()_ accepts a list of option strings, and although the documentation's example shows a single short option and a single long option, you can give it more than one of each and it will work.

So yes, you can perfectly reasonably create an argparse option that can be invoked as either '--ns' or '--no-something', so that on the one hand you have a clear canonical version and on the other hand you have something short for convenience. If I'm going to do this (and sometimes I am), the thing I want to remember is that argparse's help output will report these options in the order I gave them to addargument()_ so I probably want to list the long one first, as the canonical and clearest form. In other words:

parser.add_argument("--no-something", "--ns", ....)

so that the -h output I get says:

--no-something, --ns     Don't do something

(If you have multiple '--no-...' options, abbreviated options aren't as compact as this '--ns' style. Of course it's a little bit unusual to have several long options that mean the same thing, but my view is that long options are sort of a zoo anyway and you might as well be convenient.)

Having multiple short (single letter) options for the same thing is also possible but much less in the Unix style, so I'm not sure I'd ever use it. One plausible use is mapping old short options to your real ones for compatibility (or just options that people are accustomed to using for some particular purpose from other programs, and keep using with yours).

(This is probably not news to anyone who's really used argparse. I'm partly writing this down so that I'll remember it in the future.)

The problem of Python's version dependent paths for packages

By: cks
9 August 2025 at 03:00

A somewhat famous thing about Python is that more or less all of the official ways to install packages put them into somewhere on the filesystem that contains the Python series version (which is things like '3.13' but not '3.13.5'). This is true for site packages, for 'pip install --user' (to the extent that it still works), and for virtual environments, however you manage them. And this is a problem because it means that any time you change to a new release, such as going from 3.12 to 3.13, all of your installed packages disappear (unless you keep around the old Python version and keep your virtual environments and so on using it).

In general, a lot of people would like to update to new Python releases. Linux distributions want to ship the latest Python (and usually do), various direct users of Python would like the new features, and so on. But these versions dependent paths and their consequences make version upgrades more painful and so to some extent cause them to be done less often.

In the beginning, Python had at least two reasons to use these version dependent paths. Python doesn't promise that either its bytecode (and thus the .pyc files it generates from .py files) or its C ABI (which is depended on by any compiled packages, in .so form on Linux) are stable from version to version. Python's standard installation and bytecode processing used to put both bytecode files and compiled files along side the .py files rather than separating them out. Since pure Python packages can depend on compiled packages, putting the two together has a certain sort of logic; if a compiled package no longer loads because it's for a different Python release, your pure Python packages may no longer work.

(Python bytecode files aren't so tightly connected so some time ago Python moved them into a '__pycache__' subdirectory and gave them a Python version suffix, eg '<whatever>.cpython-312.pyc'. Since they're in a subdirectory, they'll get automatically removed if you remove the package itself.)

An additional issue is that even pure Python packages may not be completely compatible with a new version of Python (and often definitely not with a sufficiently old version). So updating to a new Python version may call for a package update as well, not just using the same version you currently have.

Although I don't like the current situation, I don't know what Python could do to make it significantly better. Putting .py files (ie, pure Python packages) into a version independent directory structure would work some of the time (perhaps a lot of the time if you only went forward in Python versions, never backward) but blow up at other times, sometimes in obvious ways (when a compiled package couldn't be imported) and sometimes in subtle ones (if a package wasn't compatible with the new version of Python).

(It would probably also not be backward compatible to existing tools.)

Python argparse and the minor problem of a variable valid argument count

By: cks
28 July 2025 at 03:30

Argparse is the standard Python module for handling arguments to command line programs, and because for small programs, Python makes using things outside the standard library quite annoying, it's the one I use in my Python based utility programs. Recently I found myself dealing with a little problem where argparse doesn't have a good answer, partly because you can't nest argument groups.

Suppose, not hypothetically, that you have a program that can properly take zero, two, or three command line arguments (which are separate from options), and the command line arguments are of different types (the first is a string and the second two are numbers). Argparse makes it easy to handle having either two or three arguments, no more and no less; the first two arguments have no nargs set, and then the third sets 'nargs="?"'. However, as far as I can see argparse has no direct support for handling the zero-argument case, or rather for forbidding the one-argument one.

(If the first two arguments were of the same type we could easily gather them together into a two-element list with 'nargs=2', but they aren't, so we'd have to tell argparse that both are strings and then try the 'string to int' conversion of the second argument ourselves, losing argparse's handling of it.)

If you set all three arguments to 'nargs="?"' and give them usable default values, you can accept zero, two, or three arguments, and things will work if you supply only one argument (because the second argument will have a usable default). This is the solution I've adopted for my particular program because I'm not stubborn enough to try to roll my own validation on top of argparse, not for a little personal tool.

If argparse supported nested groups for arguments, you could potentially make a mutually exclusive argument group that contained two sub-groups, one with nothing in it and one that handled the two and three argument case. This would require argparse not only to support nested groups but to support empty nested groups (and not ignore them), which is at least a little bit tricky.

Alternately, argparse could support a global specification of what numbers of arguments are valid. Or it could support a 'validation' callback that is called with information about what argparse detected and which could signal errors to argparse that argparse handled in its standard way, giving you uniform argument validation and error text and so on.

Python argparse has a limitation on argument groups that makes me sad

By: cks
11 June 2025 at 02:28

Argparse is the straightforward standard library module for handling command line arguments, with a number of nice features. One of those nice features is groups of mutually exclusive arguments. If people can only give one of '--quiet' and '--verbose' and both together make no sense, you can put them in a mutually exclusive group and argparse will check for you and generate an appropriate error. However, mutually exclusive groups have a little limitation that makes me sad.

Suppose, not hypothetically, that you have a Python program that has some timeouts. You'd like people using the program to be able to adjust the various sorts of timeouts away from their default values and also to be able to switch it to a mode where it never times out at all. Generally it makes no sense to adjust the timeouts and also to say not to have any timeouts, so you'd like to put these in a mutually exclusive group. If you have only a single timeout, this works fine; you can have a group with '--no-timeout' and '--timeout <TIME>' and it works. However, if you have multiple sorts of timeouts that people may want adjust all of, this doesn't work. If you put all of the options in a single mutually exclusive group, people can only adjust one timeout, not several of them. What you want is for the '--no-timeouts' switch to be mutually exclusive with a group of all of the timeout switches.

Unfortunately, if you read the current argparse documentation, you will find this note:

Changed in version 3.11: Calling add_argument_group() or add_mutually_exclusive_group() on a mutually exclusive group is deprecated. These features were never supported and do not always work correctly. The functions exist on the API by accident through inheritance and will be removed in the future.

You can nest a mutually exclusive group inside a regular group, and there are some uses for this. But you can't nest any sort of group inside a mutually exclusive group (or a regular group inside of a regular group). At least not officially, and there are apparently known issues with doing so that won't ever be fixed, so you probably shouldn't do it at all.

Oh well, it would have been nice.

(I suspect one reason that this isn't officially supported is that working out just what was conflicting with what in a pile of nested groups (and what error message to emit) might be a bit complex and require explicit code to handle this case.)

As an extended side note, checking this by hand isn't necessarily all that easy. If you have something, such as timeouts, that have a default value but can be changed by the user, the natural way to set them up in argparse is to make the argparse default value your real default value and then use the value argparse sets in your program. If the person running the program used the switch, you'll get their value, and if not you'll get your default value, and everything works out. Unfortunately this usage makes it difficult or impossible to see if the person running your program explicitly gave a particular switch. As far as I know, argparse doesn't expose this information, so at a minimum you have to know what your default value is and then check to see if the current value is different (and this doesn't catch the admittedly unlikely case of the person using the switch with the default value).

Adding your own attributes to Python functions and Python typing

By: cks
7 June 2025 at 02:05

Every so often I have some Python code where I have a collection of functions and along with the functions, some additional information about them. For example, the functions might implement subcommands and there might be information about help text, the number of command line arguments, and so on. There are a variety of approaches for this, but a very simple one I've tended to use is to put one or more additional attributes on the functions. This looks like:

def dosomething(....):
  [....]
dosomething._cmdhelp_ = "..."

(These days you might use a decorator on the function instead of explicitly attaching attributes, but I started doing this before decorators were as much of a thing in Python.)

Unfortunately, as I've recently discovered this pattern is one that (current) Python type checkers don't really like. For perfectly rational reasons, Python type checkers like to verify that every attribute you're setting on something actually exists and isn't, say, a typo. Functions don't normally have random user-defined attributes and so type checkers will typically complain about this (as I found out in the course of recent experiments with new type checkers).

For new code, there are probably better patterns these days, such as writing a decorator that auto-registers the subcommand's function along with its help text, argument information, and so on. For existing code, this is a bit annoying, although I can probably suppress the warnings. It would be nice if type checkers understood this idiom but adding your own attributes to individual functions (or other standard Python types) is probably so rare there's no real point.

(And it's not as if I'm adding this attribute to all functions in my program, only to the ones that implement subcommands.)

The moral that I draw from this is that old code that I may want to use type-inferring type checkers on (cf) may have problems beyond missing type hints and needing some simple changes to pacify the type checkers. It's probably not worth doing a big overhaul of such code to modernize it. Alternately, perhaps I want to make the code simpler and less tricky, even though it's more verbose to write (for example, explicitly listing all of the subcommand functions along with their help text and other things in a dict). The more verbose version will be easier for me in the future (or my co-workers) to follow, even if it's less clever and more typing up front.

Python type checkers work in different ways and can check different things

By: cks
5 June 2025 at 03:18

For all of the time so far that I've been poking at Python's type checking, I've known that there was more than one program for type checking but I've basically ignored that and used mypy. My understanding was that mypy was the first Python type checker and the only fully community-based one, with the other type checkers the product of corporations and sometimes at least partially tied to things like Microsoft's efforts to get everyone hooked on VSCode, and I assumed that the type checkers mostly differed in things like speed and what they integrated with. Recently, I read Pyrefly vs. ty: Comparing Python’s Two New Rust-Based Type Checkers (via) and discovered that I was wrong about this, and at least some Python type checkers work quite differently from mypy in ways that matter to me.

Famously (for those who've used it), mypy really wants you to add explicit typing information to your code. I believe it has some ability to deduce types for you, but at least for me it doesn't do very much to our programs without types (although part of this is that I need to turn on '--check-untyped-defs'). Other type checkers are more willing to be aggressive about deducing types from your code without explicit typing information. This is potentially interesting to me because we have a lot of code without types at work and we'll probably never add explicit type hints to it. Being able to use type checking to spot potential errors in this un-hinted code would be useful, if the various type checkers can understand the code well enough.

(In quick experiments, some of the type checkers need some additional hints, like explicitly initializing objects with the right types instead of 'None' or adding asserts to tell them that values are set. In theory they could deduce this stuff from code flow analysis, although in some cases it might need relatively sophisticated value propagation.)

Discovering this means that I'm at least going to keep my eye on the alternate type checkers, and maybe add some little bits to our programs to make the type checkers happier with things. These are early days for both of the new ones from the article and my experiments suggest that some of their deduced typing is off some of the time, but I can hope that will improve with more development.

(There are also some idioms that bits of our code use that probably will never be fully accepted by type checkers, but that's another entry.)

PS: My early experiments didn't turn up anything in the code that I tried it on, but then this code is already running stably in production. It would be a bit weird to discover a significant type confusion bug in any of it at this point. Still, checking is reassuring, especially about sections of the code that aren't exercised very often, such as error handling.

Python, type hints, and feeling like they create a different language

By: cks
20 May 2025 at 02:31

At this point I've only written a few, relatively small programs with type hints. At times when doing this, I've wound up feeling that I was writing programs in a language that wasn't quite exactly Python (but obviously was closely related to it). What was idiomatic in one language was non-idiomatic in the other, and I wanted to write code differently. This feeling of difference is one reason I've kept going back and forth over whether I should use type hints (well, in personal programs).

Looking back, I suspect that this is partly a product of a style where I tried to use typing.NewType a lot. As I found out, this may not really be what I want to do. Using type aliases (or just structural descriptions of the types) seems like it's going to be easier, since it's mostly just a matter of marking up things. I also suspect that this feeling that typed Python is a somewhat different language from plain Python is a product of my lack of experience with typed Python (which I can fix by doing more with types in my own code, perhaps revising existing programs to add type annotations).

However, I suspect some of this feeling of difference is that you (I) want to structure 'typed' Python code differently than untyped code. In untyped Python, duck typing is fine, including things like returning None or some meaningful type, and you can to a certain extent pass things around without caring what type they are. In this sort of situation, typed Python has pushed me toward narrowing the types involved in my code (although typing.Optional can help here). Sometimes this is a good thing; at other times, I wind up using '0.0' to mean 'this float value is not set' when in untyped Python I would use 'None' (because propagating the type difference of the second way through the code is too annoying). Or to put it another way, typed Python feels less casual, and there are good and bad aspects to this.

Unfortunately, one significant source of Python code that I work on is effectively off limits for type hints, and that's the Python code I write for work. For that code, I need to stick to the subset of Python that my co-workers know and can readily understand, and that subset doesn't include Python's type hints. I could try to teach my co-workers about type hints, but my view is that if I'm wrestling with whether it's worth it, my co-workers will be even less receptive to the idea of trying to learn and remember them (especially when they look at my Python code only infrequently). If we were constantly working with medium to large Python programs where type hints were valuable for documenting things and avoiding irritating errors it would be one thing, but as it is our programs are small and we can go months between touching any Python code. I care about Python type hints and have active exposure to them, and even I have to refresh my memory on them from time to time.

(Perhaps some day type hints will be pervasive enough in third party Python code and code examples that my co-workers will absorb and remember them through osmosis, but that day isn't today.)

Updating venv-based things by replacing the venv not updating it

By: cks
29 April 2025 at 03:01

These days, we have mostly switched over to installing third-party Python programs (and sometimes things like Django) in virtual environments instead of various past practices. This is clearly the way Python expects you to do things and increasingly problems emerge if you don't. One of the issues I've been thinking about is how we want to handle updating these programs when they release new versions, because there are two approaches.

One option would be to update the existing venv in place, through various 'pip' commands. However, pip-based upgrades have some long standing issues, and also they give you no straightforward way to revert an upgrade if something goes wrong. The other option is to build a separate venv with the new version of the program (and all of its current dependency versions) and then swap the whole new venv into place, which works because venvs can generally be moved around. You can even work with symbolic links, creating a situation where you refer to 'dir/program', which is a symlink to 'dir/venvs/program-1.2.0' or 'dir/venvs/programs-1.3.0' or whatever you want today.

In practice we're more likely to have 'dir/program' be a real venv and just create 'dir/program-new', rename directories, and so on. The full scale version with always versioned directories is likely to only be used for things, like Django, where we want to be able to easily see what version we're running and switch back very simply.

Our Django versions were always going to be handled by building entirely new venvs and switching to them (it's the venv version of what we did before). We haven't had upgrades of other venv based programs until recently, and when I started thinking about it, I reached the obvious conclusion: we'll update everything by building a new venv and replacing the old one, because this deals with pretty much all of the issues at the small cost of yet more disk space for yet more venvs.

(This feels quite obvious once I'd made the decision, but I want to write it down anyway. And who knows, maybe there are reasons to update venvs in place. The one that I can think of is to only change the main program version but not any of the dependencies, if they're still compatible.)

Using PyPy (or thinking about it) exposed a bug in closing files

By: cks
2 March 2025 at 03:20

Over on the Fediverse, I said:

A fun Python error some code can make and not notice until you run it under PyPy is a function that has 'f.close' at the end instead of 'f.close()' where f is an open()'d file.

(Normal CPython will immediately close the file when the function returns due to refcounted GC. PyPy uses non-refcounted GC so the file remains open until GC happens, and so you can get too many files open at once. Not explicitly closing files is a classic PyPy-only Python bug.)

When a Python file object is garbage collected, Python arranges to close the underlying C level file descriptor if you didn't already call .close(). In CPython, garbage collection is deterministic and generally prompt; for example, when a function returns, all of its otherwise unreferenced local variables will be garbage collected as their reference counts drop to zero. However, PyPy doesn't use reference counting for its garbage collection; instead, like Go, it only collects garbage periodically, and so will only close files as a side effect some time later. This can make it easy to build up a lot of open files that aren't doing anything, and possibly run your program out of available file descriptors, something I've run into in the past.

I recently wanted to run a hacked up version of a NFS monitoring program written in Python under PyPy instead of CPython, so it would run faster and use less CPU on the systems I was interested in. Since I remembered this PyPy issue, I found myself wondering if it properly handled closing the file(s) it had to open, or if it left it to CPython garbage collection. When I looked at the code, what I found can be summarized as 'yes and no':

def parse_stats_file(filename):
  [...]
  f = open(filename)
  [...]
  f.close

  return ms_dict

Because I was specifically looking for uses of .close(), the lack of the '()' immediately jumped out at me (and got fixed in my hacked version).

It's easy to see how this typo could linger undetected in CPython. The line 'f.close' itself does nothing but isn't an error, and then 'f' is implicitly closed in the next line, as part of the 'return', so even if you looking at this program's file descriptor usage while it's running you won't see any leaks.

(I'm not entirely a fan of nondeterministic garbage collection, at least in the context of Python, where deterministic GC was a long standing feature of the language in practice.)

Providing pseudo-tags in DWiki through a simple hack

By: cks
10 February 2025 at 03:56

DWiki is the general filesystem based wiki engine that underlies this blog, and for various reasons having to do with how old it is, it lacks a number of features. One of the features that I've wanted for more than a decade has been some kind of support for attaching tags to entries and then navigating around using them (although doing this well isn't entirely easy). However, it was always a big feature, both in implementing external files of tags and in tagging entries, and so I never did anything about it.

Astute observers of Wandering Thoughts may have noticed that some years ago, it acquired some topic indexes. You might wonder how this was implemented if DWiki still doesn't have tags (and the answer isn't that I manually curate the lists of entries for each topic, because I'm not that energetic). What happened is that when the issue was raised in a comment on an entry, I realized that I sort of already had tags for some topics because of how I formed the 'URL slugs' of entries (which are their file names). When I wrote about some topics, such as Prometheus, ZFS, or Go, I'd almost always put that word in the wikiword that became the entry's file name. This meant that I could implement a low rent version of tags simply by searching the (file) names of entries for words that matched certain patterns. This was made easier because I already had code to obtain the general list of file names of entries since that's used for all sorts of things in a blog (syndication feeds, the front page, and so on).

That this works as well as it does is a result of multiple quirks coming together. DWiki is a wiki so I try to make entry file names be wikiwords, and because I have an alphabetical listing of all entries that I look at regularly, I try to put relevant things in the file name of entries so I can find them again and all of the entries about a given topic sort together. Even in a file based blog engine, people don't necessarily form their file names to put a topic in them; you might make the file name be a slug-ized version of the title, for example.

(The actual implementation allows for both positive and negative exceptions. Not all of my entries about Go have 'Go' as a word, and some entries with 'Go' in their file name aren't about Go the language, eg.)

Since the implementation is a hack that doesn't sit cleanly within DWiki's general model of the world, it has some unfortunate limitations (so far, although fixing them would require more hacks). One big one is that as far as the rest of DWiki is concerned, these 'topic' indexes are plain pages with opaque text that's materialized through internal DWikiText rendering. As such, they don't (and can't) have Atom syndication feeds, the way proper fully supported tags would (and you can't ask for 'the most recent N Go entries', and so on; basically there are no blog-like features, because they all require directories).

One of the lessons I took from the experience of hacking pseudo-tag support together was that as usual, sometimes the perfect (my image of nice, generalized tags) is the enemy of the good enough. My solution for Prometheus, ZFS, and Go as topics isn't at all general, but it works for these specific needs and it was easy to put together once I had the idea. Another lesson is that sometimes you have more data than you think, and you can do a surprising amount with it once you realize this. I could have implemented these simple tags years before I did, but until the comment gave me the necessary push I just hadn't thought about using the information that was already in entry names (and that I myself used when scanning the list).

A change in the handling of PYTHONPATH between Python 3.10 and 3.12

By: cks
22 January 2025 at 03:40

Our long time custom for installing Django for our Django based web application was to install it with 'python3 setup.py install --prefix /some/where', and then set a PYTHONPATH environment variable that pointed to /some/where/lib/python<ver>/site-packages. Up through at least Python 3.10 (in Ubuntu 22.04), you could start Python 3 and then successfully do 'import django' with this; in fact, it worked on different Python versions if you were pointing at the same directory tree (in our case, this directory tree lives on our NFS fileservers). In our Ubuntu 24.04 version of Python 3.12 (which also has the Ubuntu packaged setuptools installed), this no longer works, which is inconvenient to us.

(It also doesn't seem to work in Fedora 40's 3.12.8, so this probably isn't something that Ubuntu 24.04 broke by using an old version of Python 3.12, unlike last time.)

The installed site-packages directory contains a number of '<package>.egg' directories, a site.py file that I believe is generic, and an easy-install.pth that lists the .egg directories. In Python 3.10, strace says that Python 3 opens site.py and then easy-install.pth during startup, and then in a running interpreter, 'sys.path' contains the .egg directories. In Python 3.12, none of this happens, although CPython does appear to look at the overall 'site-packages' directory and 'sys.path' contains it, as you'd expect. Manually adding the .egg directories to a 3.12 sys.path appears to let 'import django' work, although I don't know if everything is working correctly.

I looked through the 3.11 and 3.12 "what's new" documentation (3.11, 3.12) but couldn't find anything obvious. I suspect that this is related to the removal of distutils in 3.12, but I don't know enough to say for sure.

(Also, if I use our usual Django install process, the Ubuntu 24.04 Python 3.12 installs Django in a completely different directory setup than in 3.10; it now winds up in <top level>/local/lib/python3.12/dist-packages. Using 'pip install --prefix ...' does create something where pointing PYTHONPATH at the 'dist-packages' subdirectory appears to work. There's also 'pip install --target', which I'd forgotten about until I stumbled over my old entry.)

All of this makes it even more obvious to me than before that the Python developers expect everyone to use venvs and anything else is probably going to be less and less well supported in the future. Installing system-wide is probably always going to work, and most likely also 'pip install --user', but I'm not going to hold my breath for anything else.

(On Ubuntu 24.04, obviously we'll have to move to a venv based Django installation. Fortunately you can use venvs with programs that are outside the venv.)

Some stuff about how Apache's mod_wsgi runs your Python apps (as of 5.0)

By: cks
17 January 2025 at 04:13

We use mod_wsgi to host our Django application, but if I understood the various mod_wsgi settings for how to run your Python WSGI application when I originally set it up, I've forgotten it all since then. Due to recent events, exactly how mod-wsgi runs our application and what we can control about that is now quite relevant, so I spent some time looking into things and trying to understand settings. Now it's time to write all of this down before I forget it (again).

Mod_wsgi can run your WSGI application in two modes, as covered in the quick configuration guide part of its documentation: embedded mode, which runs a Python interpreter inside a regular Apache process, and daemon mode, where one or more Apache processes are taken over by mod_wsgi and used exclusively to run WSGI applications. Normally you want to use daemon mode, and you have to use daemon mode if you want to do things like run your WSGI application as a Unix user other than the web server's normal user or use packages installed into a Python virtual environment.

(Running as a separate Unix user puts some barriers between your application's data and a general vulnerability that gives the attacker read and/or write access to anything the web server has access to.)

To use daemon mode, you need to configure one or more daemon processes with WSGIDaemonProcess. If you're putting packages (such as Django) into a virtual environment, you give an appropriate 'python-home=' setting here. Your application itself doesn't have to be in this venv. If your application lives outside your venv, you will probably want to set either or both of 'home=' and 'python-path=' to, for example, its root directory (especially if it's a Django application). The corollary to this is that any WSGI application that uses a different virtual environment, or 'home' (starting current directory), or Python path needs to be in a different daemon process group. Everything that uses the same process group shares all of those.

To associate a WSGI application or a group of them with a particular daemon process, you use WSGIProcessGroup. In simple configurations you'll have WSGIDaemonProcess and WSGIProcessGroup right next to each other, because you're defining a daemon process group and then immediately specifying that it's used for your application.

Within a daemon process, WSGI applications can run in either the main Python interpreter or a sub-interpreter (assuming that you don't have sub-interpreter specific problems). If you don't set any special configuration directive, each WSGI application will run in its own sub-interpreter and the main interpreter will be unused. To change this, you need to set something for WSGIApplicationGroup, for instance 'WSGIApplicationGroup %{GLOBAL}' to run your WSGI application in the main interpreter.

Some WSGI applications can cohabit with each other in the same interpreter (where they will potentially share various bits of global state). Other WSGI applications are one to an interpreter, and apparently Django is one of them. If you need your WSGI application to have its own interpreter, there are two ways to achieve this; you can either give it a sub-interpreter within a shared daemon process, or you can give it its own daemon process and have it use the main interpreter in that process. If you need different virtual environments for each of your WSGI applications (or different Unix users), then you'll have to use different daemon processes and you might as well have everything run in their respective main interpreters.

(After recent experiences, my feeling is that processes are probably cheap and sub-interpreters are a somewhat dark corner of Python that you're probably better off avoiding unless you have a strong reason to use them.)

You normally specify your WSGI application to run (and what URL it's on) with WSGIScriptAlias. WSGIScriptAlias normally infers both the daemon process group and the (sub-interpreter) 'application group' from its context, but you can explicitly set either or both. As the documentation notes (now that I'm reading it):

If both process-group and application-group options are set, the WSGI script file will be pre-loaded when the process it is to run in is started, rather than being lazily loaded on the first request.

I'm tempted to deliberately set these to their inferred values simply so that we don't get any sort of initial load delay the first time someone hits one of the exposed URLs of our little application.

For our Django application, we wind up with a collection of directives like this (in its virtual host):

WSGIDaemonProcess accounts ....
WSGIProcessGroup accounts
WSGIApplicationGroup %{GLOBAL}
WSGIScriptAlias ...

(This also needs a <Directory> block to allow access to the Unix directory that the WSGIScriptAlias 'wsgi.py' file is in.)

If we added another Django application in the same virtual host, I believe that the simple update to this would be to add:

WSGIDaemonProcess secondapp ...
WSGIScriptAlias ... process-group=secondapp application-group=%{GLOBAL}

(Plus the <Directory> permissions stuff.)

Otherwise we'd have to mess around with setting the WSGIProcessGroup and WSGIApplicationGroup on a per-directory basis for at least the new application. If we specify them directly in WSGIScriptAlias we can skip that hassle.

(We didn't used to put Django in a venv, but as of Ubuntu 24.04, using a venv seems the easiest way to get a particular Django version into some spot where you can use it. Our Django application doesn't live inside the venv, but we need to point mod_wsgi at the venv so that our application can do 'import django.<...>' and have it work. Multiple Django applications could all share the venv, although they'd have to use different WSGIDaemonProcess settings, or at least different names with the same other settings.)

(Multiple) inheritance in Python and implicit APIs

By: cks
16 January 2025 at 04:16

The ultimate cause of our mystery with Django on Ubuntu 24.04 is that versions of Python 3.12 before 3.12.5 have a bug where builtin types in sub-interpreters get unexpected additional slot wrappers (also), and Ubuntu 24.04 has 3.12.3. Under normal circumstances, 'list' itself doesn't have a '__str__' method but instead inherits it from 'object', so if you have a class that inherits from '(list,YourClass)' and YourClass defines a __str__, the YourClass.__str__ is what gets used. In a sub-interpreter, there is a list.__str__ and suddenly YourClass.__str__ isn't used any more.

(mod_wsgi triggers this issue because in a straightforward configuration, it runs everything in sub-interpreters.)

This was an interesting bug, and one of the things it made me realize is that the absence of a __str__ method on 'list' itself had implicitly because part of list's API. Django had set up class definitions that were 'class Something(..., list, AMixin)', where the 'AMixin' had a direct __str__ method, and Django expected that to work. This only works as long as 'list' doesn't have its own __str__ method and instead gets it through inheritance from object.__str__. Adding such a method to 'list' would break Django and anyone else counting on this behavior, making the lack of the method an implicit API.

(You can get this behavior with more or less any method that people might want to override in such a mixin class, but Python's special methods are probably especially prone to it.)

Before I ran into this issue, I probably would have assumed that where in the class tree a special method like __str__ was implemented was simply an implementation detail, not something that was visible as part of a class's API. Obviously, I would have been wrong. In Python, you can tell the difference and quite easily write code that depends on it, code that was presumably natural to experienced Python programmers.

(Possibly the existence of this implicit API was obvious to experienced Python programmers, along with the implication that various builtin types that currently don't have their own __str__ can't be given one in the future.)

A mystery with Django under Apache's mod_wsgi on Ubuntu 24.04

By: cks
14 January 2025 at 04:10

We have a long standing Django web application that these days runs under Python 3 and a more modern version of Django. For as long as it has existed, it's had some forms that were rendered to HTML through templates, and it has rendered errors in those forms in what I think of as the standard way:

{{ form.non_field_errors }}
{% for field in form %}
  [...]
  {{ field.errors }}
  [...]
{% endfor %}

This web application runs in Apache using mod_wsgi, and I've recently been working on moving the host this web application runs on to Ubuntu 24.04 (still using mod_wsgi). When I stood up a test virtual machine and looked at some of these HTML forms, what I saw was that when there were no errors, each place that errors would be reported was '[]' instead of blank. This did not happen if I ran the web application on the same test machine in Django's 'runserver' development testing mode.

At first I thought that this was something to do with locales, but the underlying cause is much more bizarre and inexplicable to me. The template operation for form.non_field_errors results in calling Form.non_field_errors(), which returns a django.forms.utils.ErrorList object (which is also what field.errors winds up being). This class is a multiple-inheritance subclass of UserList, list, and django.form.utils.RenderableErrorMixin. The latter is itself a subclass of django.forms.utils.RenderableMixin, which defines a __str__() special method value that is RenderableMixin.render(), which renders the error list properly, including rendering it as a blank if the error list is empty.

In every environment except under Ubuntu 24.04's mod_wsgi, ErrorList.__str__ is RenderableMixin.render and everything works right for things like 'form.non_field_errors' and 'field.errors'. When running under Ubuntu 24.04's mod_wsgi, and only then, ErrorList.__str__ is actually the standard list.__str__, so empty lists render as '[]' (and had I tried to render any forms with actual error reports, worse probably would have happened, especially since list.__str__ isn't carefully escaping special HTML characters).

I have no idea why this is happening in the 24.04 mod_wsgi. As far as I can tell, the method resolution order (MRO) for ErrorList is the same under mod_wsgi as outside it, and sys.path is the same. The RenderableErrorMixin class is getting included as a parent of ErrorList, which I can tell because RenderableMixin also provides a __html__ definition, and ErrorList.__html__ exists and is correct.

The workaround for this specific situation is to explicitly render errors to some format instead of counting on the defaults; I picked .as_ul(), because this is what we've normally gotten so far. However the whole thing makes me nervous since I don't understand what's special about the Ubuntu 24.04 mod_wsgi and who knows if other parts of Django are affected by this.

(The current Django and mod_wsgi setup is running from a venv, so it should also be fully isolated from any Ubuntu 24.04 system Python packages.)

(This elaborates on a grumpy Fediverse post of mine.)

Two views of Python type hints and catching bugs

By: cks
23 December 2024 at 04:03

I recently wrote a little Python program where I ended up adding type hints, an experience that I eventually concluded was worth it overall even if it was sometimes frustrating. I recently fixed a small bug in the program; like many of my bugs, it was a subtle logic bug that wasn't caught by typing (and I don't think it would have been caught by any reasonable typing).

One view you could take of type hints is that they often don't catch any actual bugs, and so you can question their worth (when viewed only from a bug catching perspective). Another view, one that I'm more inclined to, is that type hints sweep away the low hanging fruit of bugs. A type confusion bug is almost always found pretty fast when you try to use the code, because your code usually doesn't work at all. However, using type hints and checking them provides early and precise detection of these obvious bugs, so you get rid of them right away before they take up your time with you trying to work out why this object doesn't have the methods or fields that you expect.

("Type hints", which is to say documenting what types are used where for what, also have additional benefits, such as accurate documentation and enabling type based things in IDEs, LSP servers, and so on.)

So although my use of type hints and mypy didn't catch this particular logic oversight, my view of them remains positive. And type hints did help me make sure I wasn't adding an obvious bug when I fixed this issue (my fix required passing an extra argument to something, creating an opportunity for a bit of type confusion if I got the arguments wrong).

Sidebar: my particular non-type bug

This program reports the current, interesting alerts from our Prometheus metrics system. For various reasons, it supports getting the alerts as of some specific time, not just 'now', and it also filters out some alerts when they aren't old enough. My logic bug with was with the filtering; in order to compute the age of an alert, I did:

age = time.time() - alert_started_at

The logic problem is that when I'm getting the alerts at a particular time instead of 'now', I also want to compute the age of the alert as of that time, not as of 'right now'. So I don't want 'time.time()', I want 'as of the logical time when we're obtaining this information'.

(This sort of logic oversight is typical for non-obvious bugs that linger in my programs after they're basically working. I only noticed it because I was adding a new filter, and needed to get the alerts as of a time when what I wanted to filter out was happening.)

Python type hints are probably "worth it" in the large for me

By: cks
30 November 2024 at 04:07

I recently added type hints to a little program, and that experience wasn't entirely positive that left me feeling that maybe I shouldn't bother. Because I don't promise to be consistent, I went back and re-added type hints to the program all over again, starting from the non-hinted version. This time I did the type hints rather differently and the result came out well enough that I'm going to keep it.

Perhaps my biggest change was to entirely abandon NewType(). Instead I set up two NamedTuples and used type aliases for everything else, which amounts to three type aliases in total. Since I was using type aliases anyway, I only added them when it was annoying to enter the real type (and I was doing it often enough). I skipped doing a type alias for 'list[namedTupleType]' because I couldn't come up with a name that I liked well enough and that it's a list is fundamental to how it's interacted with in the code involved, so I didn't feel like obscuring that.

Adding type hints 'for real' had the positive aspect of encouraging me to write a bunch of comments about what things were and how they worked, which will undoubtedly help future me when I want to change something in six months. Since I was using NamedTuples, I changed to accessing the elements of the tuples through the names instead of the indexes, which improved the code. I had to give up 'list(adict.items())' in favour of a list comprehension that explicitly created the named tuple, but this is probably a good thing for the overall code quality.

(I also changed the type of one thing I had as 'int' to a float, which is what it really should have been all along even if all of the normal values were integers.)

Overall, I think I've come around to the view that doing all of this is good for me in the same way that using shellcheck is good for my shell scripts, even if I sometimes roll my eyes at things it says. I also think that just making mypy silent isn't the goal I should be aiming for. Instead, I should be aiming for what I did to my program on this second pass, doing things like introducing named tuples (in some form), adding comments, and so on. Adding final type hints should be a prompt for a general cleanup.

(Perhaps I'll someday get to a point where I add basic type hints as I write the code initially, just to codify my belief about the shape of what I'm returning and passing in, and use them to find my mistakes. But that day is probably not today, and I'll probably want better LSP integration for it in my GNU Emacs environment.)

Some notes on my experiences with Python type hints and mypy

By: cks
28 November 2024 at 04:35

As I thought I might, today I spent some time adding full and relatively honest type hints to my recent Python program. The experience didn't go entirely smoothly and it left me with a number of learning experiences and things I want to note down in case I ever do this again. The starting point is that my normal style of coding small programs is to not make classes to represent different sorts of things and instead use only basic built in collection types, like lists, tuples, dictionaries, and so on. When you use basic types this way, it's very easy to pass or return the wrong 'shape' of thing (I did it once in the process of writing my program), and I'd like Python type hints to be able to tell me about this.

(The first note I want to remember is that mypy becomes very irate at you in obscure ways if you ever accidentally reuse the same (local) variable name for two different purposes with two different types. I accidentally reused the name 'data', using it first for a str and second for a dict that came from an 'Any' typed object, and the mypy complaints were hard to decode; I believe it complained that I couldn't index a str with a str on a line where I did 'data["key"]'.)

When you work with data structures created from built in collections, you can wind up with long, tangled compound type name, like 'tuple[str, list[tuple[str, int]]]' (which is a real type in my program). These are annoying to keep typing and easy to make mistakes with, so Python type hints provide two ways of giving them short names, in type aliases and typing.NewType. These look almost the same:

# type alias:
type hostAlertsA = tuple[str, list[tuple[str, int]]]

# NewType():
hostAlertsT = NewType('hostAlertsT', tuple[str, list[tuple[str, int]]])

The problem with type aliases is that they are aliases. All aliases for a type are considered to be the same, and mypy won't warn if you call a function that expects one with a value that was declared to be another. Suppose you have two sorts of strings, ones that are a host name and ones that are an alert name, and you would like to keep them straight. Suppose that you write:

# simple type aliases
type alertName = str
type hostName = str

func manglehost(hname: hostName) -> hostName:
  [....]

Because these are only type aliases and because all type aliases are treated as the same, you have not achieved your goal of keeping you from confusing host and alert names when you call 'manglehost()'. In order to do this, you need to use NewType(), at which point mypy will complain (and also often force you to explicitly mark bare strings as one or the other, with 'alertName(yourstr)' or 'hostName(yourstr)').

If I want as much protection against this sort of type confusion, I want to make as many things as possible be NewType()s instead of type aliases. Unfortunately NewType()s have some drawbacks in mypy for my sort of usage as far as I can see.

The first drawback is that you cannot create a NewType of 'Any':

error: Argument 2 to NewType(...) must be subclassable (got "Any")  [valid-newtype]

In order to use NewType, I must specify concrete details of my actual (current) implementation, rather than saying just 'this is a distinct type but anything can be done with it'.

The second drawback is that this distinct typing is actually a problem when you do certain sorts of transformations of collections. Let's say we have alerts, which have a name and a start time, and hosts, which have a hostname and a list of alerts:

alertT  = NewType('alertT',  tuple[str, int])
hostAlT = NewType('hostAlT', tuple[str, list[alertT]])

We have a function that receives a dictionary where the keys are hosts and the values are their alerts and turns it into a sorted list of hosts and their alerts, which is to say a list[hostAlT]). The following Python code looks sensible on the surface:

def toAlertList(hosts: dict[str, list[alertT]) -> list[hostAlT]:
  linear = list(hosts.items())
  # Don't worry about the sorting for now
  return linear

If you try to check this, mypy will declare:

error: Incompatible return value type (got "list[tuple[str, list[alertT]]]", expected "list[hostAlT]")  [return-value]

Initially I thought this was mypy being limited, but in writing this entry I've realized that mypy is correct. Our .items() returns a tuple[str, list[alertT]], but while it has the same shape as our hostAlT, it is not the same thing; that's what it means for hostAlT to be a distinct type.

However, it is a problem that as far as I know, there is no type checked way to get mypy to convert the list we have into a list[hostAlT]. If you create a new NewType to be the list type, all it 'aListT', and try to convert 'linear' to it with 'l2 = aListT(linear)', you will get more or less the same complaint:

error: Argument 1 to "aListT" has incompatible type "list[tuple[str, list[alertT]]]"; expected "list[hostAlT]"  [arg-type]

This is a case where as far as I can see I must use a type alias for 'hostAlT' in order to get the structural equivalence conversion, or alternately use the wordier and as far as I know less efficient list comprehension version of list() so that I can tell mypy that I'm transforming each key/value pair into a hostAlT value:

linear = [hostAlT(x) for x in hosts.items()]

I'd have the same problem in the actual code (instead of in the type hint checking) if I was using, for example, a namedtuple to represent a host and its alerts. Calling hosts.items() wouldn't generate objects of my named tuple type, just unnamed standard tuples.

Possibly this is a sign that I should go back through my small programs after I more or less finish them and convert this sort of casual use of tuples into namedtuple (or the type hinted version) and dataclass types. If nothing else, this would serve as more explicit documentation for future me about what those tuple fields are. I would have to give up those clever 'list(hosts.items())' conversion tricks in favour of the more explicit list comprehension version, but that's not necessarily a bad thing.

Sidebar: aNewType(...) versus typing.cast(typ, ....)

If you have a distinct NewType() and mypy is happy enough with you, both of these will cause mypy to consider your value to now be of the new type. However, they have different safety levels and restrictions. With cast(), there are no type hint checking guardrails at all; you can cast() an integer literal into an alleged string and mypy won't make a peep. With, for example, 'hostAlT(...)', mypy will apply a certain amount of compatibility checking. However, as we saw above in the 'aListT' example, mypy may still report a problem on the type change and there are certain type changes you can't get it to accept.

As far as I know, there's no way to get mypy to temporarily switch to a structural compatibility checking here. Perhaps there are deep type safety reasons to disallow that.

Python type hints may not be for me in practice

By: cks
27 November 2024 at 03:58

Python 3 has optional type hints (and has had them for some time), and some time ago I was a bit tempted to start using some of them; more recently, I wrote a small amount of code using them. Recently I needed to write a little Python program and as I started, I was briefly tempted to try type hints. Then I decided not to, and I suspect that this is how it's going to go in the future.

The practical problem of type hints for me when writing the kind of (small) Python programs that I do today is that they necessarily force me to think about the types involved. Well, that's wrong, or at least incomplete; in practice, they force me to come up with types. When I'm putting together a small program, generally I'm not building any actual data structures, records, or the like (things that have a natural type); instead I'm passing around dictionaries and lists and sets and other basic Python types, and I'm revising how I use them as I write more of the program and evolve it. Adding type hints requires me to navigate assigning concrete types to all of those things, and then updating them if I change my mind as I come to a better understanding of the problem and how I want to approach it.

(In writing this it occurs to me that I do often know that I have distinct types (for example, for what functions return) and I shouldn't mix them, but I don't want to specify their concrete shape as dicts, tuples, or whatever. In looking through the typing documentation and trying some things, it doesn't seem like there's an obvious way to do this. Type aliases are explicitly equivalent to their underlying thing, so I can't create a bunch of different names for eg typing.Any and then expect type checkers to complain if I mix them.)

After the code has stabilized I can probably go back to write type hints (at least until I get into apparently tricky things like JSON), but I'm not sure that this would provide very much value. I may try it with my recent little Python thing just to see how much work it is. One possible source of value is if I come back to this code in six months or a year and want to make changes; typing hints could give me both documentation and guardrails given that I'll have forgotten about a lot of the code and structure by then.

(I think the usual advice is that you should write type hints as you write the program, rather than go back after the fact and try to add them, because incrementally writing them during development is easier. But my new Python programs tend to sufficiently short that doing all of the type hints afterward isn't too much work, and if it gets me to do it at all it may be an improvement.)

PS: It might be easier to do type hints on the fly if I practiced with them, but on the other hand I write new Python programs relatively infrequently these days, making typing hints yet another Python thing I'd have to try to keep in my mind despite it being months since I used them last.

PPS: I think my ideal type hint situation would be if I could create distinct but otherwise unconstrained types for things like function arguments and function returns, have mypy or other typing tools complain when I mixed them, and then later go back to fill in the concrete implementation details of each type hint (eg, 'this is a list where each element is a ...').

What's going on with 'quit' in an interactive CPython session (as of 3.12)

By: cks
27 August 2024 at 01:33

We're probably all been there at some time or the other:

$ python
[...]
>>> quit
Use quit() or Ctrl-D (i.e. EOF) to exit

It's an infamous and frustrating 'error' message and we've probably all seen it (there's a similar one for 'exit'). Today I was reminded of this CPython behavior by a Fediverse conversation and as I was thinking about it, the penny belatedly dropped on what is going on here in CPython.

Let's start with this:

>>> type(quit)
<class '_sitebuiltins.Quitter'>

In CPython 3.12 and earlier, the CPython interactive interpreter evaulates Python statements; as far as I know, it has little to no special handling of what you type to it, it just evaluates things and then prints the result under appropriate circumstances. So 'quit' is not special syntax recognized by the interpreter, but instead a Python object. The message being printed is not special handling but instead a standard CPython interpreter feature to helpfully print the representation of objects, which the _sitebuiltins.Quitter class has customized to print this message. You can see all of this in Lib/_sitebuiltins.py, along with classes used for some other, related things.

(Then the 'quit' and 'exit' instances are created and wired up in Lib/site.py, along with a number of other things.)

This is changing in Python 3.13 (via), which defaults to using a new interactive shell, which I believe is called 'pyrepl' (see Libs/_pyrepl). Pyrepl has specific support for commands like 'quit', although this support actually reuses the _sitebuiltins code (see REPL_COMMANDS in Lib/_pyrepl/simple_interact.py). Basically, pyrepl knows to call some objects instead of printing their repr() if they're entered alone on a line, so you enter 'quit' and it winds up being the same as if you'd said 'quit()'.

We may want /usr/bin/python to be Python 3 sooner than I expected

By: cks
1 August 2024 at 02:19

For historical reasons, we still have a '/usr/bin/python' that is Python 2 on our Ubuntu 22.04 machines. Yes, we know, Python 2 isn't supported any more, but our users have had more than a decade where /usr/bin/python was Python 2 and while Ubuntu continued to ship a Python 2, we didn't feel like breaking their '#!/usr/bin/python' lines in scripts by either removing /usr/bin/python or making it Python 3. That option ran out in Ubuntu 24.04, which doesn't ship any Python 2 packages and so provides no native way to have a Python 2 /usr/bin/python (you can make a symlink to your own version of Python 2, if you really insist). In my entry on the state of Python in Ubuntu 24.04, I speculated that we might wind up with /usr/bin/python existing and being Python 3 in Ubuntu 26.04. With more time and more water under the bridge, I think we're fairly likely to do that or even move faster, partly because there are forces pushing reasonably strongly in that direction.

One of the things that I've been doing is watching for things running '/usr/bin/python' on our current login servers, because those things are going to break when we start upgrading them from Ubuntu 22.04 to Ubuntu 24.04 and I'd like to warn people in advance. In doing this, I found a number of people who seemed to every now and then run '/usr/bin/python' in a VSCode environment. Now, I rather doubt that people who are using VSCode are writing Python 2 programs here in 2024. Instead, I think it's much more likely that something in their VSCode environment is invoking '/usr/bin/python' and expecting to get Python 3.

Here in 2024, I suspect that this is a perfectly reasonable expectation and almost always works out for whatever VSCode related bit is doing it. Python 2 has been unsupported for four years and probably almost all Linux systems with a /usr/bin/python have it being Python 3 (this has been the situation on Fedora Linux for some time, for example). I also suspect that most Linux systems do have a /usr/bin/python. We are the weird outliers, and as weird outliers we can expect things to not work; at first a few things, and then more things as the assumption that '/usr/bin/python' is the way you get Python 3 becomes embedded in more and more software.

(I suspect that VSCode is not the only thing doing this on our systems, merely the one that's most visible to me right now.)

Having written this entry, I'm now reconsidering our schedule. As far as I can tell, we have low usage of /usr/bin/python today, although my checks aren't necessarily comprehensive, which means that relatively few people will be affected by a change to what it is. So rather than waiting until Ubuntu 26.04 or later to make /usr/bin/python be Python 3, perhaps we should wait only six months or so after we roll out Ubuntu 24.04 before switching from having no /usr/bin/python (and any remaining people having their scripts fail to run) to having it be Python 3. The result would probably be better for both people and programs.

PS: The simple answer to why not immediately switch /usr/bin/python to Python 3 when we move to Ubuntu 24.04 is that the error messages people will get for /usr/bin/python being missing are likely to be clearer than the ones they would get from running Python 2 code under Python 3.

Understanding a Python closure oddity

By: cks
17 June 2024 at 03:31

Recently, Glyph pointed out a Python oddity on the Fediverse and I had to stare at it for a bit to understand what was going on, partly because my mind is partly thinking in Go these days, and Go has a different issue in similar code. So let's start with the code:

def loop():
    for number in range(10):
        def closure():
            return number
        yield closure

eagerly = [each() for each in loop()]
lazily = [each() for each in list(loop())]

The oddity is that 'eagerly' and 'lazily' wind up different, and why.

The first thing that is going on in this Python code is that while 'number' is only used in the for loop, it is an ordinary function local variable. We could set it before the loop and look at it after the loop if we wanted to, and if we did, it would be '9' at the end of the for loop. The consequence and the corollary is that every closure returned in the 'for' loop is using the same 'number' local variable.

(In some languages and in some circumstances, each closure would close over a different instance of 'number'; see for example this Go 1.22 change.)

Since all of the closures are using the same 'number' local variable, what matters for what value they return is when they are called. When you call any of them, it will return the value of 'number' that is in effect in the 'loop' function as of that moment. And if you call any of them after the 'loop' function has finished, 'number' has the value of '9'.

This also means that if you call a single 'each' function more than once, the value it returns can be different. For example:

>>> g = loop()
>>> each0 = g.__next__()
>>> each0()
0
>>> each1 = g.__next__()
>>> each0()
1

(What the 'loop()' call actually returns is a generator. I'm directly calling its magic method to be explicit, rather than using the more general next().)

And in a way this is the difference between 'eagerly' and 'lazily'. For 'eagerly', the list comprehension iterates through the results of 'loop()' and immediately calls each version of 'each' that it obtains, which gets the value of 'number' that is in effect right then. For 'lazily', the 'list(loop())' first collects all of the 'each' closures, which ends the 'for' loop in the 'loop' function and means 'number' is now '9', and then calls all of the 'each' closures, which all return the final value of 'number'.

The 'eagerly' and 'lazily' names may be a bit confusing (they were to me). What they refer to is whether we eagerly or lazily call each closure as it is returned by 'loop()'. In 'eagerly', we call the closures immediately; in 'lazily', we call them only later, after the 'for' loop is done and 'number' has taken on its final value. As Glyph said on the Fediverse, there is another level of eagerness or laziness, which is how aggressively we iterate the generator from 'loop()', and this is actually backward from the names; in 'eagerly' we lazily iterate the generator, while in 'lazily' we eagerly iterate the generator (that's what the 'list()' does).

(I'm writing this entry partly for myself, because someday I may run into an issue like this in my own Python code. If you only use a generator with code patterns like the 'eagerly' case, an issue like this could lurk undetected for some time.)

PyPy has been quietly working for me for several years now

By: cks
30 May 2024 at 02:48

A number of years ago I switched to installing various Python programs through pipx so that each of them got their own automatically managed virtual environment, rather than me having to wrestle with various issues from alternate approaches. On our Ubuntu servers, it wound up being simpler to do this using my own version of PyPy instead of Ubuntu's CPython, for various reasons. I've been operating this way for long enough that I didn't really remember how long.

Recently we got our first cloud server, and I wound up installing our cloud provider's basic CLI tool. This CLI tool has a number of official ways of installing it, but when the dust settles I discovered it was a Python package (with a bunch of additional complicated dependencies) and this package is available on PyPi. So I decided to see if 'pipx install <x>' would work, which it did. Only much later did it occur to me that this very large Python and stuff tool was running happily under PyPy, because this is the default if I just 'pipx install' something.

As it turns out, everything I have installed through pipx on our servers is currently installed using PyPy instead of CPython, and all of it works fine. I've been running all sorts of code with PyPy for years without noticing anything different. There is definitely code that will notice (I used to have some), but either I haven't encountered any of it yet or significant packages are now routinely tested under PyPy and hardened against things like deferred garbage collection of open files.

(Some current Python idioms, such as the 'with' statement, avoid this sort of problem, because they explicitly close files and otherwise release resources as you're done with them.)

In a way there's nothing remarkable about this. PyPy's goal is to be a replacement for CPython that simply works while generally being faster. In another way, it's nice to see that PyPy has been basically completely successful in this for me, to the extent that I can forget that my pipx-installed things are all running under PyPy and that a big cloud vendor thing just worked.

The state of Python in Ubuntu 24.04 LTS

By: cks
1 May 2024 at 02:40

Ubuntu 24.04 LTS has just been released and as usual it's on our minds, although not as much so as Ubuntu 22.04 was. So once again I feel like doing a quick review of the state of Python in 24.04, as I did for 22.04. Since Fedora 40 has also just been released I'm going to throw that in too.

The big change between 22.04 and 24.04 for us is that 24.04 has entirely dropped Python 2 packages. There is no CPython 2, which has been unsupported by the main Python developers for years, but there's also no Python 2 version of PyPy, which is supported upstream and will be for a long time (cf). At the moment, the Python 2 binary .debs from Ubuntu 22.04 LTS still install and work well enough for us on Ubuntu 24.04, but the writing is on the wall there. In Ubuntu 26.04 we will likely have to compile our own Python from source (and not the .deb sources, which don't seem to readily rebuild on 24.04). It's possible that someone has a PPA with CPython 2 for 24.04; I haven't looked.

(Yes, we still care about Python 2 because we have system management scripts that have been there for fifteen years and which are written in Python 2.)

In Ubuntu 22.04, /usr/bin/python was an optional symbolic link that could point to either Python 2 or Python 3. In 24.04 it is still an optional symbolic link, but now your only option is Python 3. We've opted to have no /usr/bin/python in our 24.04 installation, so that any of our people who are still using '#!/usr/bin/python' in scripts will have them clearly break. It's possible that in a few years (for Ubuntu 26.04 LTS, if we use it) we'll start having a /usr/bin/python that points to Python 3 (or Ubuntu will make it a mandatory part of their Python 3 package). If nothing else, that would be convenient for interactive use.

Ubuntu 24.04 has Python 3.12.3, which was released this past April 9th; this is really fast work to get it into 24.04 (although since Canonical will be supporting 24.04 for up to five years, they have a bit of a motivation to start with the latest). Perhaps unsurprisingly, Fedora 40 is a bit further behind, with Python 3.12.2. Both Ubuntu 24.04 and Fedora 40 have PyPy 7.3.15. Ubuntu 24.04 only has the Python 3.9 version of PyPy 3; Fedora has both the 3.9 and 3.10 versions.

Both Ubuntu 24.04 and Fedora 40 have pipx available as a standard package. Fedora 40 has version 1.5.0; Ubuntu 24.04 is on 1.4.3. The pipx changelog suggests that this isn't a critical difference, and I'm not certain I'd notice any difference in practice.

I suspect that Fedora won't keep its minimal CPython 2 package around forever, although I don't know what their removal schedule is. Hopefully they will keep the Python 2 version of PyPy around for at least as long as the upstream PyPy supports it. Fedora has more freedom here than Ubuntu does, since a given Fedora release only has to be supported for a year or so, instead of Ubuntu 24.04 LTS's five years (or more, if you pay for extended support from Canonical).

PS: Ubuntu 24.04 has Django version 4.2.11, the latest version of the 4.2 series, which is a sensible choice since the Django 4.2 series is one of the Django project's LTS releases and so will be supported upstream until April 2026, saving Canonical some work (cf).

Please don't try to hot-reload changed Python files too often

By: cks
13 April 2024 at 02:08

There is a person running a Python program on one of our servers, which is something that people do regularly. As far as I can tell, this person's Python program is using some Python framework that supports on the fly reloading (often called hot-reloading) of changed Python code for at least some of the loaded code, and perhaps much or all of it. Naturally, in order to see if you need to hot-reload any code, you need to check whether a bunch of files have changed (at least in our environment, some environments may be able to do this slightly better). This person's Python code is otherwise almost always idle.

The particular Python code involved has decided to check for a need to hot-reload code once every second. In our NFS fileserver environment, this has caused one particular fileserver to see a constant load of about 1100 NFS RPC operations a second, purely from the Python hot-reload code rechecking what appears to be a pile of things every second. These checks are also not cheap on the machine where the code is running; this particular process routinely uses about 7% to 8% of one CPU as it's sitting there otherwise idle.

(There was a time when you didn't necessarily care about CPU usage on otherwise idle machines. In these days of containerization and packing multiple services on one machine and renting the smallest and thus cheapest VPS you can get away with, there may be no such thing as a genuinely idle machine, and all CPU usage is coming from somewhere.)

To be fair, it's possible that the program is being run in some sort of development mode, where fast hot-reload can be potentially important. But people do run 'development mode' in more or less production, and it's possible to detect that. It would be nice if hot-reload code made some efforts to detect that, and perhaps also some efforts to detect when things were completely idle and there had been no detected changes for a long time and it should dial back the frequency of hot-reload checks. But I'm probably tilting at windmills.

(I also think that you should provide some sort of option to set the hot-reload frequency, because people are going to want to do this sooner or later. You should do this even if you only try to do hot reloading in development mode, because sooner or later people are going to run your development mode in pseudo-production because that's the easiest way for them.)

PS: These days this also applies to true development mode usage of things. People can easily step away from their development environment for meetings or whatever, and they may well be running it on their laptop, where they would like you to not burn up their battery constantly. Just because someone has a development mode environment running doesn't mean they're actively using it right now.

Platform peculiarities and Python (with an example)

By: cks
25 March 2024 at 02:53

I have a long standing little Python tool to turn IP addresses into verified hostnames and report what's wrong if it can't do this (doing verified reverse DNS lookups is somewhat complicated). Recently I discovered that socket.gethostbyaddr() on my Linux machines was only returning a single name for an IP address that was associated with more than one. A Fediverse thread revealed that this reproduced for some people, but not for everyone, and that it also happened in other programs.

The Python socket.gethostbyaddr() documentation doesn't discuss specific limitations like this, but the overall socket documentation does say that the module is basically a layer over the platform's C library APIs. However, it doesn't document exactly what APIs are used, and in this case it matters. Glibc on Linux says that gethostbyaddr() is deprecated in favour of getnameinfo(), so a C program like CPython might reasonably use either to implement its gethostbyaddr(). The C gethostbyaddr() supports returning multiple names (at least in theory), but getnameinfo() specifically does not; it only ever returns a single name.

In practice, the current CPython on Linux will normally use gethostbyaddr_r() (see Modules/socketmodule.c's socket_gethostbyaddr()). This means that CPython isn't restricted to returning a single name and is instead inheriting whatever peculiarities of glibc (or another libc, for people on Linux distributions that use an alternative libc). On glibc, it appears that this behavior depends on what NSS modules you're using, with the default glibc 'dns' NSS module not seeming to normally return multiple names this way, even for glibc APIs where this is possible.

Given all of this, it's not surprising that the CPython documentation doesn't say anything specific. There's not very much specific it can say, since the behavior varies in so many peculiar ways (and has probably changed over time). However, this does illustrate that platform peculiarities are visible through CPython APIs, for better or worse (and, like me, you may not even be aware of those peculiarities until you encounter them). If you want something that is certain to bypass platform peculiarities, you probably need to do it yourself (in this case, probably with dnspython).

(The Go documentation for a similar function does specifically say that if it uses the C library it returns at most one result, but that's because the Go authors know their function calls getnameinfo() and as mentioned, that can only return one name (at most).)

From hell to HTML: releasing a Python package to easily work with Wikimedia HTML dumps

For over 15 years, the Wikimedia Foundation has provided public dumps of the content of all wikis. They are not only useful for archiving or offline reader projects, but can also power tools for semi-automated (or bot) editing such as AutoWikiBrowser. For example, these tools comb through the dumps to generate lists of potential spelling mistakes in articles for editors to fix. For researchers, the dumps have become an indispensable data resource (footnote: Google Scholar lists more than 16,000 papers mentioning the word β€œWikipedia dumps”). Especially in the area of natural language processing, the use of Wikipedia dumps has become almost ubiquitous with the advancement of large language models such as GPT-3 (and thus by extension also the recently published ChatGPT) or BERT. Virtually all language models are trained on Wikipedia content, especially multilingual models which rely heavily on Wikipedia for many lower-resourced languages.Β 

Over time, the research community has developed many tools to help folks who want to use the dumps. For instance, the mwxml Python library helps researchers work with the large XML files and iterate through the articles within them. Before analyzing the content of the individual articles, researchers must usually further preprocess them, since they come in wikitext format. Wikitext is the markup language used to format the content of a Wikipedia article in order to, for example, highlight text in bold or add links. In order to parse wikitext, the community has built libraries such as mwparserfromhell, developed over 10 years and comprising almost 10,000 lines of code. This library provides an easy interface to identify different elements of an article, such as links, templates, or just the plain text. This ecosystem of tooling lowers the technical barriers to working with the dumps because users do not need to know the details of XML or wikitext.

While convenient, there are severe drawbacks to working with the XML dumps containing articles in wikitext. In fact, MediaWiki translates wikitext into HTML which is then displayed to the readers. Thus, some elements contained in the HTML version of the article are not readily available in the wikitext version; for example, due to the use of templates. This means that parsing only wikitext means that researchers might ignore important content which is displayed to readers. For example, a study by Mitrevski et al. found for English Wikipedia that from the 475M internal links in the HTML versions of the articles, only 171M (36%) were present in the wikitext version.

Therefore, it is often desirable to work with HTML versions of the articles instead of using the wikitext versions. Though, in practice this has remained largely impossible for researchers. Using the MediaWiki APIs or scraping Wikipedia directly for the HTML is computationally expensive at scale and discouraged for large projects. Only recently, the Wikimedia Enterprise HTML dumps have been introduced and made publicly available with regular monthly updates so that researchers or anyone else may use them in their work.Β 

However, while the data is available, it still requires lots of technical expertise by researchers, such as how different elements from wikitext get parsed into HTML elements. In order to lower the technical barriers and improve the accessibility of this incredible resource, we released the first version of mwparserfromhtml, a library that makes it easy to parse the HTML content of Wikipedia articles – inspired by the wikitext-oriented mwparserfromhell.

Elements of an article mwparserfromhtml can extract from an article
Figure 1. Examples of different types of elements that mwparserfromhtml can extract from an article

The tool is written in Python and available as a pip-installable package. It provides two main functionalities. First, it allows the user to access all articles in the dump files one by one in an iterative fashion. Second, it contains a parser for the individual HTML of the article. Using the Python library beautifulsoup, we can parse the content of the HTML and extract individual elements (see Figure 1 for examples):

  • Wikilinks (or internal links). These are annotated with additional information about the namespace of the target link or whether it is disambiguation page, redirect, red link, or interwiki link.
  • External links. We distinguish whether it is named, numbered, or autolinked.
  • CategoriesΒ 
  • Templates
  • ReferencesΒ 
  • Media. We capture the type of media (image, audio, or video) as well as the caption and alt text (if applicable).
  • Plain text of the articles

We also extract some properties of the elements that end users might care about, such as whether each element was originally included in the wikitext version or was transcluded from another page.

Building the tool posed several challenges. First, it remains difficult to systematically test the output of the tool. While we can verify that we are correctly extracting the total number of links in an article, there is no β€œright” answer for what the plain text of an article should include. For example, should image captions or lists be included? We manually annotated a handful of example articles in English to evaluate the tool’s output, but it is almost certain that we have not captured all possible edge cases. In addition, other language versions of Wikipedia might provide other elements or patterns in the HTML than the tool currently expects. Second, while much of how an article is parsed is handled by the core of MediaWiki and well documented by the Wikimedia Foundation Content Transform Team and the editor community on English Wikipedia, article content can also be altered by wiki-specific Extensions. This includes important features such as citations, and documentation about some of these aspects can be scarce or difficult to track down.Β 

The current version of mwparserfromhtml constitutes a first starting point. There are still many functionalities that we would like to add in the future, such as extracting tables, splitting the plain text into sections and paragraphs, or handing in-line templates used for unit conversion (for example displaying lbs and kg). If you have suggestions for improvements or would like to contribute, please reach out to us on the repository, and file an issue or submit a merge request.

Finally, we want to acknowledge that the project was started as part of an Outreachy internship with the Wikimedia Foundation. We encourage folks to consider mentoring or applying to the Outreachy program as appropriate.Β 


About this post

Featured image credit:Β ΠžΡ‡ΠΈΡΡ‚ΠΊΠ° Ρ€Ρ‚ΡƒΡ‚ΠΈ ΠΏΠ΅Ρ€Π΅Π³ΠΎΠ½ΠΊΠΎΠΉ Π² Ρ‚ΠΎΠΊΠ΅ Π³Π°Π·Π°.png in theΒ public domain

Figure 1 image credit:Β Mwparserfromhtml functionality.gif by Isaac (WMF) licensed under theΒ Creative CommonsΒ Attribution-Share Alike 4.0 InternationalΒ license

❌
❌