❌

Normal view

There are new articles available, click to refresh the page.
Before yesterdayMain stream

Using PyPy (or thinking about it) exposed a bug in closing files

By: cks
2 March 2025 at 03:20

Over on the Fediverse, I said:

A fun Python error some code can make and not notice until you run it under PyPy is a function that has 'f.close' at the end instead of 'f.close()' where f is an open()'d file.

(Normal CPython will immediately close the file when the function returns due to refcounted GC. PyPy uses non-refcounted GC so the file remains open until GC happens, and so you can get too many files open at once. Not explicitly closing files is a classic PyPy-only Python bug.)

When a Python file object is garbage collected, Python arranges to close the underlying C level file descriptor if you didn't already call .close(). In CPython, garbage collection is deterministic and generally prompt; for example, when a function returns, all of its otherwise unreferenced local variables will be garbage collected as their reference counts drop to zero. However, PyPy doesn't use reference counting for its garbage collection; instead, like Go, it only collects garbage periodically, and so will only close files as a side effect some time later. This can make it easy to build up a lot of open files that aren't doing anything, and possibly run your program out of available file descriptors, something I've run into in the past.

I recently wanted to run a hacked up version of a NFS monitoring program written in Python under PyPy instead of CPython, so it would run faster and use less CPU on the systems I was interested in. Since I remembered this PyPy issue, I found myself wondering if it properly handled closing the file(s) it had to open, or if it left it to CPython garbage collection. When I looked at the code, what I found can be summarized as 'yes and no':

def parse_stats_file(filename):
  [...]
  f = open(filename)
  [...]
  f.close

  return ms_dict

Because I was specifically looking for uses of .close(), the lack of the '()' immediately jumped out at me (and got fixed in my hacked version).

It's easy to see how this typo could linger undetected in CPython. The line 'f.close' itself does nothing but isn't an error, and then 'f' is implicitly closed in the next line, as part of the 'return', so even if you looking at this program's file descriptor usage while it's running you won't see any leaks.

(I'm not entirely a fan of nondeterministic garbage collection, at least in the context of Python, where deterministic GC was a long standing feature of the language in practice.)

Providing pseudo-tags in DWiki through a simple hack

By: cks
10 February 2025 at 03:56

DWiki is the general filesystem based wiki engine that underlies this blog, and for various reasons having to do with how old it is, it lacks a number of features. One of the features that I've wanted for more than a decade has been some kind of support for attaching tags to entries and then navigating around using them (although doing this well isn't entirely easy). However, it was always a big feature, both in implementing external files of tags and in tagging entries, and so I never did anything about it.

Astute observers of Wandering Thoughts may have noticed that some years ago, it acquired some topic indexes. You might wonder how this was implemented if DWiki still doesn't have tags (and the answer isn't that I manually curate the lists of entries for each topic, because I'm not that energetic). What happened is that when the issue was raised in a comment on an entry, I realized that I sort of already had tags for some topics because of how I formed the 'URL slugs' of entries (which are their file names). When I wrote about some topics, such as Prometheus, ZFS, or Go, I'd almost always put that word in the wikiword that became the entry's file name. This meant that I could implement a low rent version of tags simply by searching the (file) names of entries for words that matched certain patterns. This was made easier because I already had code to obtain the general list of file names of entries since that's used for all sorts of things in a blog (syndication feeds, the front page, and so on).

That this works as well as it does is a result of multiple quirks coming together. DWiki is a wiki so I try to make entry file names be wikiwords, and because I have an alphabetical listing of all entries that I look at regularly, I try to put relevant things in the file name of entries so I can find them again and all of the entries about a given topic sort together. Even in a file based blog engine, people don't necessarily form their file names to put a topic in them; you might make the file name be a slug-ized version of the title, for example.

(The actual implementation allows for both positive and negative exceptions. Not all of my entries about Go have 'Go' as a word, and some entries with 'Go' in their file name aren't about Go the language, eg.)

Since the implementation is a hack that doesn't sit cleanly within DWiki's general model of the world, it has some unfortunate limitations (so far, although fixing them would require more hacks). One big one is that as far as the rest of DWiki is concerned, these 'topic' indexes are plain pages with opaque text that's materialized through internal DWikiText rendering. As such, they don't (and can't) have Atom syndication feeds, the way proper fully supported tags would (and you can't ask for 'the most recent N Go entries', and so on; basically there are no blog-like features, because they all require directories).

One of the lessons I took from the experience of hacking pseudo-tag support together was that as usual, sometimes the perfect (my image of nice, generalized tags) is the enemy of the good enough. My solution for Prometheus, ZFS, and Go as topics isn't at all general, but it works for these specific needs and it was easy to put together once I had the idea. Another lesson is that sometimes you have more data than you think, and you can do a surprising amount with it once you realize this. I could have implemented these simple tags years before I did, but until the comment gave me the necessary push I just hadn't thought about using the information that was already in entry names (and that I myself used when scanning the list).

A change in the handling of PYTHONPATH between Python 3.10 and 3.12

By: cks
22 January 2025 at 03:40

Our long time custom for installing Django for our Django based web application was to install it with 'python3 setup.py install --prefix /some/where', and then set a PYTHONPATH environment variable that pointed to /some/where/lib/python<ver>/site-packages. Up through at least Python 3.10 (in Ubuntu 22.04), you could start Python 3 and then successfully do 'import django' with this; in fact, it worked on different Python versions if you were pointing at the same directory tree (in our case, this directory tree lives on our NFS fileservers). In our Ubuntu 24.04 version of Python 3.12 (which also has the Ubuntu packaged setuptools installed), this no longer works, which is inconvenient to us.

(It also doesn't seem to work in Fedora 40's 3.12.8, so this probably isn't something that Ubuntu 24.04 broke by using an old version of Python 3.12, unlike last time.)

The installed site-packages directory contains a number of '<package>.egg' directories, a site.py file that I believe is generic, and an easy-install.pth that lists the .egg directories. In Python 3.10, strace says that Python 3 opens site.py and then easy-install.pth during startup, and then in a running interpreter, 'sys.path' contains the .egg directories. In Python 3.12, none of this happens, although CPython does appear to look at the overall 'site-packages' directory and 'sys.path' contains it, as you'd expect. Manually adding the .egg directories to a 3.12 sys.path appears to let 'import django' work, although I don't know if everything is working correctly.

I looked through the 3.11 and 3.12 "what's new" documentation (3.11, 3.12) but couldn't find anything obvious. I suspect that this is related to the removal of distutils in 3.12, but I don't know enough to say for sure.

(Also, if I use our usual Django install process, the Ubuntu 24.04 Python 3.12 installs Django in a completely different directory setup than in 3.10; it now winds up in <top level>/local/lib/python3.12/dist-packages. Using 'pip install --prefix ...' does create something where pointing PYTHONPATH at the 'dist-packages' subdirectory appears to work. There's also 'pip install --target', which I'd forgotten about until I stumbled over my old entry.)

All of this makes it even more obvious to me than before that the Python developers expect everyone to use venvs and anything else is probably going to be less and less well supported in the future. Installing system-wide is probably always going to work, and most likely also 'pip install --user', but I'm not going to hold my breath for anything else.

(On Ubuntu 24.04, obviously we'll have to move to a venv based Django installation. Fortunately you can use venvs with programs that are outside the venv.)

Some stuff about how Apache's mod_wsgi runs your Python apps (as of 5.0)

By: cks
17 January 2025 at 04:13

We use mod_wsgi to host our Django application, but if I understood the various mod_wsgi settings for how to run your Python WSGI application when I originally set it up, I've forgotten it all since then. Due to recent events, exactly how mod-wsgi runs our application and what we can control about that is now quite relevant, so I spent some time looking into things and trying to understand settings. Now it's time to write all of this down before I forget it (again).

Mod_wsgi can run your WSGI application in two modes, as covered in the quick configuration guide part of its documentation: embedded mode, which runs a Python interpreter inside a regular Apache process, and daemon mode, where one or more Apache processes are taken over by mod_wsgi and used exclusively to run WSGI applications. Normally you want to use daemon mode, and you have to use daemon mode if you want to do things like run your WSGI application as a Unix user other than the web server's normal user or use packages installed into a Python virtual environment.

(Running as a separate Unix user puts some barriers between your application's data and a general vulnerability that gives the attacker read and/or write access to anything the web server has access to.)

To use daemon mode, you need to configure one or more daemon processes with WSGIDaemonProcess. If you're putting packages (such as Django) into a virtual environment, you give an appropriate 'python-home=' setting here. Your application itself doesn't have to be in this venv. If your application lives outside your venv, you will probably want to set either or both of 'home=' and 'python-path=' to, for example, its root directory (especially if it's a Django application). The corollary to this is that any WSGI application that uses a different virtual environment, or 'home' (starting current directory), or Python path needs to be in a different daemon process group. Everything that uses the same process group shares all of those.

To associate a WSGI application or a group of them with a particular daemon process, you use WSGIProcessGroup. In simple configurations you'll have WSGIDaemonProcess and WSGIProcessGroup right next to each other, because you're defining a daemon process group and then immediately specifying that it's used for your application.

Within a daemon process, WSGI applications can run in either the main Python interpreter or a sub-interpreter (assuming that you don't have sub-interpreter specific problems). If you don't set any special configuration directive, each WSGI application will run in its own sub-interpreter and the main interpreter will be unused. To change this, you need to set something for WSGIApplicationGroup, for instance 'WSGIApplicationGroup %{GLOBAL}' to run your WSGI application in the main interpreter.

Some WSGI applications can cohabit with each other in the same interpreter (where they will potentially share various bits of global state). Other WSGI applications are one to an interpreter, and apparently Django is one of them. If you need your WSGI application to have its own interpreter, there are two ways to achieve this; you can either give it a sub-interpreter within a shared daemon process, or you can give it its own daemon process and have it use the main interpreter in that process. If you need different virtual environments for each of your WSGI applications (or different Unix users), then you'll have to use different daemon processes and you might as well have everything run in their respective main interpreters.

(After recent experiences, my feeling is that processes are probably cheap and sub-interpreters are a somewhat dark corner of Python that you're probably better off avoiding unless you have a strong reason to use them.)

You normally specify your WSGI application to run (and what URL it's on) with WSGIScriptAlias. WSGIScriptAlias normally infers both the daemon process group and the (sub-interpreter) 'application group' from its context, but you can explicitly set either or both. As the documentation notes (now that I'm reading it):

If both process-group and application-group options are set, the WSGI script file will be pre-loaded when the process it is to run in is started, rather than being lazily loaded on the first request.

I'm tempted to deliberately set these to their inferred values simply so that we don't get any sort of initial load delay the first time someone hits one of the exposed URLs of our little application.

For our Django application, we wind up with a collection of directives like this (in its virtual host):

WSGIDaemonProcess accounts ....
WSGIProcessGroup accounts
WSGIApplicationGroup %{GLOBAL}
WSGIScriptAlias ...

(This also needs a <Directory> block to allow access to the Unix directory that the WSGIScriptAlias 'wsgi.py' file is in.)

If we added another Django application in the same virtual host, I believe that the simple update to this would be to add:

WSGIDaemonProcess secondapp ...
WSGIScriptAlias ... process-group=secondapp application-group=%{GLOBAL}

(Plus the <Directory> permissions stuff.)

Otherwise we'd have to mess around with setting the WSGIProcessGroup and WSGIApplicationGroup on a per-directory basis for at least the new application. If we specify them directly in WSGIScriptAlias we can skip that hassle.

(We didn't used to put Django in a venv, but as of Ubuntu 24.04, using a venv seems the easiest way to get a particular Django version into some spot where you can use it. Our Django application doesn't live inside the venv, but we need to point mod_wsgi at the venv so that our application can do 'import django.<...>' and have it work. Multiple Django applications could all share the venv, although they'd have to use different WSGIDaemonProcess settings, or at least different names with the same other settings.)

(Multiple) inheritance in Python and implicit APIs

By: cks
16 January 2025 at 04:16

The ultimate cause of our mystery with Django on Ubuntu 24.04 is that versions of Python 3.12 before 3.12.5 have a bug where builtin types in sub-interpreters get unexpected additional slot wrappers (also), and Ubuntu 24.04 has 3.12.3. Under normal circumstances, 'list' itself doesn't have a '__str__' method but instead inherits it from 'object', so if you have a class that inherits from '(list,YourClass)' and YourClass defines a __str__, the YourClass.__str__ is what gets used. In a sub-interpreter, there is a list.__str__ and suddenly YourClass.__str__ isn't used any more.

(mod_wsgi triggers this issue because in a straightforward configuration, it runs everything in sub-interpreters.)

This was an interesting bug, and one of the things it made me realize is that the absence of a __str__ method on 'list' itself had implicitly because part of list's API. Django had set up class definitions that were 'class Something(..., list, AMixin)', where the 'AMixin' had a direct __str__ method, and Django expected that to work. This only works as long as 'list' doesn't have its own __str__ method and instead gets it through inheritance from object.__str__. Adding such a method to 'list' would break Django and anyone else counting on this behavior, making the lack of the method an implicit API.

(You can get this behavior with more or less any method that people might want to override in such a mixin class, but Python's special methods are probably especially prone to it.)

Before I ran into this issue, I probably would have assumed that where in the class tree a special method like __str__ was implemented was simply an implementation detail, not something that was visible as part of a class's API. Obviously, I would have been wrong. In Python, you can tell the difference and quite easily write code that depends on it, code that was presumably natural to experienced Python programmers.

(Possibly the existence of this implicit API was obvious to experienced Python programmers, along with the implication that various builtin types that currently don't have their own __str__ can't be given one in the future.)

A mystery with Django under Apache's mod_wsgi on Ubuntu 24.04

By: cks
14 January 2025 at 04:10

We have a long standing Django web application that these days runs under Python 3 and a more modern version of Django. For as long as it has existed, it's had some forms that were rendered to HTML through templates, and it has rendered errors in those forms in what I think of as the standard way:

{{ form.non_field_errors }}
{% for field in form %}
  [...]
  {{ field.errors }}
  [...]
{% endfor %}

This web application runs in Apache using mod_wsgi, and I've recently been working on moving the host this web application runs on to Ubuntu 24.04 (still using mod_wsgi). When I stood up a test virtual machine and looked at some of these HTML forms, what I saw was that when there were no errors, each place that errors would be reported was '[]' instead of blank. This did not happen if I ran the web application on the same test machine in Django's 'runserver' development testing mode.

At first I thought that this was something to do with locales, but the underlying cause is much more bizarre and inexplicable to me. The template operation for form.non_field_errors results in calling Form.non_field_errors(), which returns a django.forms.utils.ErrorList object (which is also what field.errors winds up being). This class is a multiple-inheritance subclass of UserList, list, and django.form.utils.RenderableErrorMixin. The latter is itself a subclass of django.forms.utils.RenderableMixin, which defines a __str__() special method value that is RenderableMixin.render(), which renders the error list properly, including rendering it as a blank if the error list is empty.

In every environment except under Ubuntu 24.04's mod_wsgi, ErrorList.__str__ is RenderableMixin.render and everything works right for things like 'form.non_field_errors' and 'field.errors'. When running under Ubuntu 24.04's mod_wsgi, and only then, ErrorList.__str__ is actually the standard list.__str__, so empty lists render as '[]' (and had I tried to render any forms with actual error reports, worse probably would have happened, especially since list.__str__ isn't carefully escaping special HTML characters).

I have no idea why this is happening in the 24.04 mod_wsgi. As far as I can tell, the method resolution order (MRO) for ErrorList is the same under mod_wsgi as outside it, and sys.path is the same. The RenderableErrorMixin class is getting included as a parent of ErrorList, which I can tell because RenderableMixin also provides a __html__ definition, and ErrorList.__html__ exists and is correct.

The workaround for this specific situation is to explicitly render errors to some format instead of counting on the defaults; I picked .as_ul(), because this is what we've normally gotten so far. However the whole thing makes me nervous since I don't understand what's special about the Ubuntu 24.04 mod_wsgi and who knows if other parts of Django are affected by this.

(The current Django and mod_wsgi setup is running from a venv, so it should also be fully isolated from any Ubuntu 24.04 system Python packages.)

(This elaborates on a grumpy Fediverse post of mine.)

Two views of Python type hints and catching bugs

By: cks
23 December 2024 at 04:03

I recently wrote a little Python program where I ended up adding type hints, an experience that I eventually concluded was worth it overall even if it was sometimes frustrating. I recently fixed a small bug in the program; like many of my bugs, it was a subtle logic bug that wasn't caught by typing (and I don't think it would have been caught by any reasonable typing).

One view you could take of type hints is that they often don't catch any actual bugs, and so you can question their worth (when viewed only from a bug catching perspective). Another view, one that I'm more inclined to, is that type hints sweep away the low hanging fruit of bugs. A type confusion bug is almost always found pretty fast when you try to use the code, because your code usually doesn't work at all. However, using type hints and checking them provides early and precise detection of these obvious bugs, so you get rid of them right away before they take up your time with you trying to work out why this object doesn't have the methods or fields that you expect.

("Type hints", which is to say documenting what types are used where for what, also have additional benefits, such as accurate documentation and enabling type based things in IDEs, LSP servers, and so on.)

So although my use of type hints and mypy didn't catch this particular logic oversight, my view of them remains positive. And type hints did help me make sure I wasn't adding an obvious bug when I fixed this issue (my fix required passing an extra argument to something, creating an opportunity for a bit of type confusion if I got the arguments wrong).

Sidebar: my particular non-type bug

This program reports the current, interesting alerts from our Prometheus metrics system. For various reasons, it supports getting the alerts as of some specific time, not just 'now', and it also filters out some alerts when they aren't old enough. My logic bug with was with the filtering; in order to compute the age of an alert, I did:

age = time.time() - alert_started_at

The logic problem is that when I'm getting the alerts at a particular time instead of 'now', I also want to compute the age of the alert as of that time, not as of 'right now'. So I don't want 'time.time()', I want 'as of the logical time when we're obtaining this information'.

(This sort of logic oversight is typical for non-obvious bugs that linger in my programs after they're basically working. I only noticed it because I was adding a new filter, and needed to get the alerts as of a time when what I wanted to filter out was happening.)

Python type hints are probably "worth it" in the large for me

By: cks
30 November 2024 at 04:07

I recently added type hints to a little program, and that experience wasn't entirely positive that left me feeling that maybe I shouldn't bother. Because I don't promise to be consistent, I went back and re-added type hints to the program all over again, starting from the non-hinted version. This time I did the type hints rather differently and the result came out well enough that I'm going to keep it.

Perhaps my biggest change was to entirely abandon NewType(). Instead I set up two NamedTuples and used type aliases for everything else, which amounts to three type aliases in total. Since I was using type aliases anyway, I only added them when it was annoying to enter the real type (and I was doing it often enough). I skipped doing a type alias for 'list[namedTupleType]' because I couldn't come up with a name that I liked well enough and that it's a list is fundamental to how it's interacted with in the code involved, so I didn't feel like obscuring that.

Adding type hints 'for real' had the positive aspect of encouraging me to write a bunch of comments about what things were and how they worked, which will undoubtedly help future me when I want to change something in six months. Since I was using NamedTuples, I changed to accessing the elements of the tuples through the names instead of the indexes, which improved the code. I had to give up 'list(adict.items())' in favour of a list comprehension that explicitly created the named tuple, but this is probably a good thing for the overall code quality.

(I also changed the type of one thing I had as 'int' to a float, which is what it really should have been all along even if all of the normal values were integers.)

Overall, I think I've come around to the view that doing all of this is good for me in the same way that using shellcheck is good for my shell scripts, even if I sometimes roll my eyes at things it says. I also think that just making mypy silent isn't the goal I should be aiming for. Instead, I should be aiming for what I did to my program on this second pass, doing things like introducing named tuples (in some form), adding comments, and so on. Adding final type hints should be a prompt for a general cleanup.

(Perhaps I'll someday get to a point where I add basic type hints as I write the code initially, just to codify my belief about the shape of what I'm returning and passing in, and use them to find my mistakes. But that day is probably not today, and I'll probably want better LSP integration for it in my GNU Emacs environment.)

Some notes on my experiences with Python type hints and mypy

By: cks
28 November 2024 at 04:35

As I thought I might, today I spent some time adding full and relatively honest type hints to my recent Python program. The experience didn't go entirely smoothly and it left me with a number of learning experiences and things I want to note down in case I ever do this again. The starting point is that my normal style of coding small programs is to not make classes to represent different sorts of things and instead use only basic built in collection types, like lists, tuples, dictionaries, and so on. When you use basic types this way, it's very easy to pass or return the wrong 'shape' of thing (I did it once in the process of writing my program), and I'd like Python type hints to be able to tell me about this.

(The first note I want to remember is that mypy becomes very irate at you in obscure ways if you ever accidentally reuse the same (local) variable name for two different purposes with two different types. I accidentally reused the name 'data', using it first for a str and second for a dict that came from an 'Any' typed object, and the mypy complaints were hard to decode; I believe it complained that I couldn't index a str with a str on a line where I did 'data["key"]'.)

When you work with data structures created from built in collections, you can wind up with long, tangled compound type name, like 'tuple[str, list[tuple[str, int]]]' (which is a real type in my program). These are annoying to keep typing and easy to make mistakes with, so Python type hints provide two ways of giving them short names, in type aliases and typing.NewType. These look almost the same:

# type alias:
type hostAlertsA = tuple[str, list[tuple[str, int]]]

# NewType():
hostAlertsT = NewType('hostAlertsT', tuple[str, list[tuple[str, int]]])

The problem with type aliases is that they are aliases. All aliases for a type are considered to be the same, and mypy won't warn if you call a function that expects one with a value that was declared to be another. Suppose you have two sorts of strings, ones that are a host name and ones that are an alert name, and you would like to keep them straight. Suppose that you write:

# simple type aliases
type alertName = str
type hostName = str

func manglehost(hname: hostName) -> hostName:
  [....]

Because these are only type aliases and because all type aliases are treated as the same, you have not achieved your goal of keeping you from confusing host and alert names when you call 'manglehost()'. In order to do this, you need to use NewType(), at which point mypy will complain (and also often force you to explicitly mark bare strings as one or the other, with 'alertName(yourstr)' or 'hostName(yourstr)').

If I want as much protection against this sort of type confusion, I want to make as many things as possible be NewType()s instead of type aliases. Unfortunately NewType()s have some drawbacks in mypy for my sort of usage as far as I can see.

The first drawback is that you cannot create a NewType of 'Any':

error: Argument 2 to NewType(...) must be subclassable (got "Any")  [valid-newtype]

In order to use NewType, I must specify concrete details of my actual (current) implementation, rather than saying just 'this is a distinct type but anything can be done with it'.

The second drawback is that this distinct typing is actually a problem when you do certain sorts of transformations of collections. Let's say we have alerts, which have a name and a start time, and hosts, which have a hostname and a list of alerts:

alertT  = NewType('alertT',  tuple[str, int])
hostAlT = NewType('hostAlT', tuple[str, list[alertT]])

We have a function that receives a dictionary where the keys are hosts and the values are their alerts and turns it into a sorted list of hosts and their alerts, which is to say a list[hostAlT]). The following Python code looks sensible on the surface:

def toAlertList(hosts: dict[str, list[alertT]) -> list[hostAlT]:
  linear = list(hosts.items())
  # Don't worry about the sorting for now
  return linear

If you try to check this, mypy will declare:

error: Incompatible return value type (got "list[tuple[str, list[alertT]]]", expected "list[hostAlT]")  [return-value]

Initially I thought this was mypy being limited, but in writing this entry I've realized that mypy is correct. Our .items() returns a tuple[str, list[alertT]], but while it has the same shape as our hostAlT, it is not the same thing; that's what it means for hostAlT to be a distinct type.

However, it is a problem that as far as I know, there is no type checked way to get mypy to convert the list we have into a list[hostAlT]. If you create a new NewType to be the list type, all it 'aListT', and try to convert 'linear' to it with 'l2 = aListT(linear)', you will get more or less the same complaint:

error: Argument 1 to "aListT" has incompatible type "list[tuple[str, list[alertT]]]"; expected "list[hostAlT]"  [arg-type]

This is a case where as far as I can see I must use a type alias for 'hostAlT' in order to get the structural equivalence conversion, or alternately use the wordier and as far as I know less efficient list comprehension version of list() so that I can tell mypy that I'm transforming each key/value pair into a hostAlT value:

linear = [hostAlT(x) for x in hosts.items()]

I'd have the same problem in the actual code (instead of in the type hint checking) if I was using, for example, a namedtuple to represent a host and its alerts. Calling hosts.items() wouldn't generate objects of my named tuple type, just unnamed standard tuples.

Possibly this is a sign that I should go back through my small programs after I more or less finish them and convert this sort of casual use of tuples into namedtuple (or the type hinted version) and dataclass types. If nothing else, this would serve as more explicit documentation for future me about what those tuple fields are. I would have to give up those clever 'list(hosts.items())' conversion tricks in favour of the more explicit list comprehension version, but that's not necessarily a bad thing.

Sidebar: aNewType(...) versus typing.cast(typ, ....)

If you have a distinct NewType() and mypy is happy enough with you, both of these will cause mypy to consider your value to now be of the new type. However, they have different safety levels and restrictions. With cast(), there are no type hint checking guardrails at all; you can cast() an integer literal into an alleged string and mypy won't make a peep. With, for example, 'hostAlT(...)', mypy will apply a certain amount of compatibility checking. However, as we saw above in the 'aListT' example, mypy may still report a problem on the type change and there are certain type changes you can't get it to accept.

As far as I know, there's no way to get mypy to temporarily switch to a structural compatibility checking here. Perhaps there are deep type safety reasons to disallow that.

Python type hints may not be for me in practice

By: cks
27 November 2024 at 03:58

Python 3 has optional type hints (and has had them for some time), and some time ago I was a bit tempted to start using some of them; more recently, I wrote a small amount of code using them. Recently I needed to write a little Python program and as I started, I was briefly tempted to try type hints. Then I decided not to, and I suspect that this is how it's going to go in the future.

The practical problem of type hints for me when writing the kind of (small) Python programs that I do today is that they necessarily force me to think about the types involved. Well, that's wrong, or at least incomplete; in practice, they force me to come up with types. When I'm putting together a small program, generally I'm not building any actual data structures, records, or the like (things that have a natural type); instead I'm passing around dictionaries and lists and sets and other basic Python types, and I'm revising how I use them as I write more of the program and evolve it. Adding type hints requires me to navigate assigning concrete types to all of those things, and then updating them if I change my mind as I come to a better understanding of the problem and how I want to approach it.

(In writing this it occurs to me that I do often know that I have distinct types (for example, for what functions return) and I shouldn't mix them, but I don't want to specify their concrete shape as dicts, tuples, or whatever. In looking through the typing documentation and trying some things, it doesn't seem like there's an obvious way to do this. Type aliases are explicitly equivalent to their underlying thing, so I can't create a bunch of different names for eg typing.Any and then expect type checkers to complain if I mix them.)

After the code has stabilized I can probably go back to write type hints (at least until I get into apparently tricky things like JSON), but I'm not sure that this would provide very much value. I may try it with my recent little Python thing just to see how much work it is. One possible source of value is if I come back to this code in six months or a year and want to make changes; typing hints could give me both documentation and guardrails given that I'll have forgotten about a lot of the code and structure by then.

(I think the usual advice is that you should write type hints as you write the program, rather than go back after the fact and try to add them, because incrementally writing them during development is easier. But my new Python programs tend to sufficiently short that doing all of the type hints afterward isn't too much work, and if it gets me to do it at all it may be an improvement.)

PS: It might be easier to do type hints on the fly if I practiced with them, but on the other hand I write new Python programs relatively infrequently these days, making typing hints yet another Python thing I'd have to try to keep in my mind despite it being months since I used them last.

PPS: I think my ideal type hint situation would be if I could create distinct but otherwise unconstrained types for things like function arguments and function returns, have mypy or other typing tools complain when I mixed them, and then later go back to fill in the concrete implementation details of each type hint (eg, 'this is a list where each element is a ...').

What's going on with 'quit' in an interactive CPython session (as of 3.12)

By: cks
27 August 2024 at 01:33

We're probably all been there at some time or the other:

$ python
[...]
>>> quit
Use quit() or Ctrl-D (i.e. EOF) to exit

It's an infamous and frustrating 'error' message and we've probably all seen it (there's a similar one for 'exit'). Today I was reminded of this CPython behavior by a Fediverse conversation and as I was thinking about it, the penny belatedly dropped on what is going on here in CPython.

Let's start with this:

>>> type(quit)
<class '_sitebuiltins.Quitter'>

In CPython 3.12 and earlier, the CPython interactive interpreter evaulates Python statements; as far as I know, it has little to no special handling of what you type to it, it just evaluates things and then prints the result under appropriate circumstances. So 'quit' is not special syntax recognized by the interpreter, but instead a Python object. The message being printed is not special handling but instead a standard CPython interpreter feature to helpfully print the representation of objects, which the _sitebuiltins.Quitter class has customized to print this message. You can see all of this in Lib/_sitebuiltins.py, along with classes used for some other, related things.

(Then the 'quit' and 'exit' instances are created and wired up in Lib/site.py, along with a number of other things.)

This is changing in Python 3.13 (via), which defaults to using a new interactive shell, which I believe is called 'pyrepl' (see Libs/_pyrepl). Pyrepl has specific support for commands like 'quit', although this support actually reuses the _sitebuiltins code (see REPL_COMMANDS in Lib/_pyrepl/simple_interact.py). Basically, pyrepl knows to call some objects instead of printing their repr() if they're entered alone on a line, so you enter 'quit' and it winds up being the same as if you'd said 'quit()'.

We may want /usr/bin/python to be Python 3 sooner than I expected

By: cks
1 August 2024 at 02:19

For historical reasons, we still have a '/usr/bin/python' that is Python 2 on our Ubuntu 22.04 machines. Yes, we know, Python 2 isn't supported any more, but our users have had more than a decade where /usr/bin/python was Python 2 and while Ubuntu continued to ship a Python 2, we didn't feel like breaking their '#!/usr/bin/python' lines in scripts by either removing /usr/bin/python or making it Python 3. That option ran out in Ubuntu 24.04, which doesn't ship any Python 2 packages and so provides no native way to have a Python 2 /usr/bin/python (you can make a symlink to your own version of Python 2, if you really insist). In my entry on the state of Python in Ubuntu 24.04, I speculated that we might wind up with /usr/bin/python existing and being Python 3 in Ubuntu 26.04. With more time and more water under the bridge, I think we're fairly likely to do that or even move faster, partly because there are forces pushing reasonably strongly in that direction.

One of the things that I've been doing is watching for things running '/usr/bin/python' on our current login servers, because those things are going to break when we start upgrading them from Ubuntu 22.04 to Ubuntu 24.04 and I'd like to warn people in advance. In doing this, I found a number of people who seemed to every now and then run '/usr/bin/python' in a VSCode environment. Now, I rather doubt that people who are using VSCode are writing Python 2 programs here in 2024. Instead, I think it's much more likely that something in their VSCode environment is invoking '/usr/bin/python' and expecting to get Python 3.

Here in 2024, I suspect that this is a perfectly reasonable expectation and almost always works out for whatever VSCode related bit is doing it. Python 2 has been unsupported for four years and probably almost all Linux systems with a /usr/bin/python have it being Python 3 (this has been the situation on Fedora Linux for some time, for example). I also suspect that most Linux systems do have a /usr/bin/python. We are the weird outliers, and as weird outliers we can expect things to not work; at first a few things, and then more things as the assumption that '/usr/bin/python' is the way you get Python 3 becomes embedded in more and more software.

(I suspect that VSCode is not the only thing doing this on our systems, merely the one that's most visible to me right now.)

Having written this entry, I'm now reconsidering our schedule. As far as I can tell, we have low usage of /usr/bin/python today, although my checks aren't necessarily comprehensive, which means that relatively few people will be affected by a change to what it is. So rather than waiting until Ubuntu 26.04 or later to make /usr/bin/python be Python 3, perhaps we should wait only six months or so after we roll out Ubuntu 24.04 before switching from having no /usr/bin/python (and any remaining people having their scripts fail to run) to having it be Python 3. The result would probably be better for both people and programs.

PS: The simple answer to why not immediately switch /usr/bin/python to Python 3 when we move to Ubuntu 24.04 is that the error messages people will get for /usr/bin/python being missing are likely to be clearer than the ones they would get from running Python 2 code under Python 3.

Understanding a Python closure oddity

By: cks
17 June 2024 at 03:31

Recently, Glyph pointed out a Python oddity on the Fediverse and I had to stare at it for a bit to understand what was going on, partly because my mind is partly thinking in Go these days, and Go has a different issue in similar code. So let's start with the code:

def loop():
    for number in range(10):
        def closure():
            return number
        yield closure

eagerly = [each() for each in loop()]
lazily = [each() for each in list(loop())]

The oddity is that 'eagerly' and 'lazily' wind up different, and why.

The first thing that is going on in this Python code is that while 'number' is only used in the for loop, it is an ordinary function local variable. We could set it before the loop and look at it after the loop if we wanted to, and if we did, it would be '9' at the end of the for loop. The consequence and the corollary is that every closure returned in the 'for' loop is using the same 'number' local variable.

(In some languages and in some circumstances, each closure would close over a different instance of 'number'; see for example this Go 1.22 change.)

Since all of the closures are using the same 'number' local variable, what matters for what value they return is when they are called. When you call any of them, it will return the value of 'number' that is in effect in the 'loop' function as of that moment. And if you call any of them after the 'loop' function has finished, 'number' has the value of '9'.

This also means that if you call a single 'each' function more than once, the value it returns can be different. For example:

>>> g = loop()
>>> each0 = g.__next__()
>>> each0()
0
>>> each1 = g.__next__()
>>> each0()
1

(What the 'loop()' call actually returns is a generator. I'm directly calling its magic method to be explicit, rather than using the more general next().)

And in a way this is the difference between 'eagerly' and 'lazily'. For 'eagerly', the list comprehension iterates through the results of 'loop()' and immediately calls each version of 'each' that it obtains, which gets the value of 'number' that is in effect right then. For 'lazily', the 'list(loop())' first collects all of the 'each' closures, which ends the 'for' loop in the 'loop' function and means 'number' is now '9', and then calls all of the 'each' closures, which all return the final value of 'number'.

The 'eagerly' and 'lazily' names may be a bit confusing (they were to me). What they refer to is whether we eagerly or lazily call each closure as it is returned by 'loop()'. In 'eagerly', we call the closures immediately; in 'lazily', we call them only later, after the 'for' loop is done and 'number' has taken on its final value. As Glyph said on the Fediverse, there is another level of eagerness or laziness, which is how aggressively we iterate the generator from 'loop()', and this is actually backward from the names; in 'eagerly' we lazily iterate the generator, while in 'lazily' we eagerly iterate the generator (that's what the 'list()' does).

(I'm writing this entry partly for myself, because someday I may run into an issue like this in my own Python code. If you only use a generator with code patterns like the 'eagerly' case, an issue like this could lurk undetected for some time.)

PyPy has been quietly working for me for several years now

By: cks
30 May 2024 at 02:48

A number of years ago I switched to installing various Python programs through pipx so that each of them got their own automatically managed virtual environment, rather than me having to wrestle with various issues from alternate approaches. On our Ubuntu servers, it wound up being simpler to do this using my own version of PyPy instead of Ubuntu's CPython, for various reasons. I've been operating this way for long enough that I didn't really remember how long.

Recently we got our first cloud server, and I wound up installing our cloud provider's basic CLI tool. This CLI tool has a number of official ways of installing it, but when the dust settles I discovered it was a Python package (with a bunch of additional complicated dependencies) and this package is available on PyPi. So I decided to see if 'pipx install <x>' would work, which it did. Only much later did it occur to me that this very large Python and stuff tool was running happily under PyPy, because this is the default if I just 'pipx install' something.

As it turns out, everything I have installed through pipx on our servers is currently installed using PyPy instead of CPython, and all of it works fine. I've been running all sorts of code with PyPy for years without noticing anything different. There is definitely code that will notice (I used to have some), but either I haven't encountered any of it yet or significant packages are now routinely tested under PyPy and hardened against things like deferred garbage collection of open files.

(Some current Python idioms, such as the 'with' statement, avoid this sort of problem, because they explicitly close files and otherwise release resources as you're done with them.)

In a way there's nothing remarkable about this. PyPy's goal is to be a replacement for CPython that simply works while generally being faster. In another way, it's nice to see that PyPy has been basically completely successful in this for me, to the extent that I can forget that my pipx-installed things are all running under PyPy and that a big cloud vendor thing just worked.

The state of Python in Ubuntu 24.04 LTS

By: cks
1 May 2024 at 02:40

Ubuntu 24.04 LTS has just been released and as usual it's on our minds, although not as much so as Ubuntu 22.04 was. So once again I feel like doing a quick review of the state of Python in 24.04, as I did for 22.04. Since Fedora 40 has also just been released I'm going to throw that in too.

The big change between 22.04 and 24.04 for us is that 24.04 has entirely dropped Python 2 packages. There is no CPython 2, which has been unsupported by the main Python developers for years, but there's also no Python 2 version of PyPy, which is supported upstream and will be for a long time (cf). At the moment, the Python 2 binary .debs from Ubuntu 22.04 LTS still install and work well enough for us on Ubuntu 24.04, but the writing is on the wall there. In Ubuntu 26.04 we will likely have to compile our own Python from source (and not the .deb sources, which don't seem to readily rebuild on 24.04). It's possible that someone has a PPA with CPython 2 for 24.04; I haven't looked.

(Yes, we still care about Python 2 because we have system management scripts that have been there for fifteen years and which are written in Python 2.)

In Ubuntu 22.04, /usr/bin/python was an optional symbolic link that could point to either Python 2 or Python 3. In 24.04 it is still an optional symbolic link, but now your only option is Python 3. We've opted to have no /usr/bin/python in our 24.04 installation, so that any of our people who are still using '#!/usr/bin/python' in scripts will have them clearly break. It's possible that in a few years (for Ubuntu 26.04 LTS, if we use it) we'll start having a /usr/bin/python that points to Python 3 (or Ubuntu will make it a mandatory part of their Python 3 package). If nothing else, that would be convenient for interactive use.

Ubuntu 24.04 has Python 3.12.3, which was released this past April 9th; this is really fast work to get it into 24.04 (although since Canonical will be supporting 24.04 for up to five years, they have a bit of a motivation to start with the latest). Perhaps unsurprisingly, Fedora 40 is a bit further behind, with Python 3.12.2. Both Ubuntu 24.04 and Fedora 40 have PyPy 7.3.15. Ubuntu 24.04 only has the Python 3.9 version of PyPy 3; Fedora has both the 3.9 and 3.10 versions.

Both Ubuntu 24.04 and Fedora 40 have pipx available as a standard package. Fedora 40 has version 1.5.0; Ubuntu 24.04 is on 1.4.3. The pipx changelog suggests that this isn't a critical difference, and I'm not certain I'd notice any difference in practice.

I suspect that Fedora won't keep its minimal CPython 2 package around forever, although I don't know what their removal schedule is. Hopefully they will keep the Python 2 version of PyPy around for at least as long as the upstream PyPy supports it. Fedora has more freedom here than Ubuntu does, since a given Fedora release only has to be supported for a year or so, instead of Ubuntu 24.04 LTS's five years (or more, if you pay for extended support from Canonical).

PS: Ubuntu 24.04 has Django version 4.2.11, the latest version of the 4.2 series, which is a sensible choice since the Django 4.2 series is one of the Django project's LTS releases and so will be supported upstream until April 2026, saving Canonical some work (cf).

Please don't try to hot-reload changed Python files too often

By: cks
13 April 2024 at 02:08

There is a person running a Python program on one of our servers, which is something that people do regularly. As far as I can tell, this person's Python program is using some Python framework that supports on the fly reloading (often called hot-reloading) of changed Python code for at least some of the loaded code, and perhaps much or all of it. Naturally, in order to see if you need to hot-reload any code, you need to check whether a bunch of files have changed (at least in our environment, some environments may be able to do this slightly better). This person's Python code is otherwise almost always idle.

The particular Python code involved has decided to check for a need to hot-reload code once every second. In our NFS fileserver environment, this has caused one particular fileserver to see a constant load of about 1100 NFS RPC operations a second, purely from the Python hot-reload code rechecking what appears to be a pile of things every second. These checks are also not cheap on the machine where the code is running; this particular process routinely uses about 7% to 8% of one CPU as it's sitting there otherwise idle.

(There was a time when you didn't necessarily care about CPU usage on otherwise idle machines. In these days of containerization and packing multiple services on one machine and renting the smallest and thus cheapest VPS you can get away with, there may be no such thing as a genuinely idle machine, and all CPU usage is coming from somewhere.)

To be fair, it's possible that the program is being run in some sort of development mode, where fast hot-reload can be potentially important. But people do run 'development mode' in more or less production, and it's possible to detect that. It would be nice if hot-reload code made some efforts to detect that, and perhaps also some efforts to detect when things were completely idle and there had been no detected changes for a long time and it should dial back the frequency of hot-reload checks. But I'm probably tilting at windmills.

(I also think that you should provide some sort of option to set the hot-reload frequency, because people are going to want to do this sooner or later. You should do this even if you only try to do hot reloading in development mode, because sooner or later people are going to run your development mode in pseudo-production because that's the easiest way for them.)

PS: These days this also applies to true development mode usage of things. People can easily step away from their development environment for meetings or whatever, and they may well be running it on their laptop, where they would like you to not burn up their battery constantly. Just because someone has a development mode environment running doesn't mean they're actively using it right now.

Platform peculiarities and Python (with an example)

By: cks
25 March 2024 at 02:53

I have a long standing little Python tool to turn IP addresses into verified hostnames and report what's wrong if it can't do this (doing verified reverse DNS lookups is somewhat complicated). Recently I discovered that socket.gethostbyaddr() on my Linux machines was only returning a single name for an IP address that was associated with more than one. A Fediverse thread revealed that this reproduced for some people, but not for everyone, and that it also happened in other programs.

The Python socket.gethostbyaddr() documentation doesn't discuss specific limitations like this, but the overall socket documentation does say that the module is basically a layer over the platform's C library APIs. However, it doesn't document exactly what APIs are used, and in this case it matters. Glibc on Linux says that gethostbyaddr() is deprecated in favour of getnameinfo(), so a C program like CPython might reasonably use either to implement its gethostbyaddr(). The C gethostbyaddr() supports returning multiple names (at least in theory), but getnameinfo() specifically does not; it only ever returns a single name.

In practice, the current CPython on Linux will normally use gethostbyaddr_r() (see Modules/socketmodule.c's socket_gethostbyaddr()). This means that CPython isn't restricted to returning a single name and is instead inheriting whatever peculiarities of glibc (or another libc, for people on Linux distributions that use an alternative libc). On glibc, it appears that this behavior depends on what NSS modules you're using, with the default glibc 'dns' NSS module not seeming to normally return multiple names this way, even for glibc APIs where this is possible.

Given all of this, it's not surprising that the CPython documentation doesn't say anything specific. There's not very much specific it can say, since the behavior varies in so many peculiar ways (and has probably changed over time). However, this does illustrate that platform peculiarities are visible through CPython APIs, for better or worse (and, like me, you may not even be aware of those peculiarities until you encounter them). If you want something that is certain to bypass platform peculiarities, you probably need to do it yourself (in this case, probably with dnspython).

(The Go documentation for a similar function does specifically say that if it uses the C library it returns at most one result, but that's because the Go authors know their function calls getnameinfo() and as mentioned, that can only return one name (at most).)

From hell to HTML: releasing a Python package to easily work with Wikimedia HTML dumps

For over 15 years, the Wikimedia Foundation has provided public dumps of the content of all wikis. They are not only useful for archiving or offline reader projects, but can also power tools for semi-automated (or bot) editing such as AutoWikiBrowser. For example, these tools comb through the dumps to generate lists of potential spelling mistakes in articles for editors to fix. For researchers, the dumps have become an indispensable data resource (footnote: Google Scholar lists more than 16,000 papers mentioning the word β€œWikipedia dumps”). Especially in the area of natural language processing, the use of Wikipedia dumps has become almost ubiquitous with the advancement of large language models such as GPT-3 (and thus by extension also the recently published ChatGPT) or BERT. Virtually all language models are trained on Wikipedia content, especially multilingual models which rely heavily on Wikipedia for many lower-resourced languages.Β 

Over time, the research community has developed many tools to help folks who want to use the dumps. For instance, the mwxml Python library helps researchers work with the large XML files and iterate through the articles within them. Before analyzing the content of the individual articles, researchers must usually further preprocess them, since they come in wikitext format. Wikitext is the markup language used to format the content of a Wikipedia article in order to, for example, highlight text in bold or add links. In order to parse wikitext, the community has built libraries such as mwparserfromhell, developed over 10 years and comprising almost 10,000 lines of code. This library provides an easy interface to identify different elements of an article, such as links, templates, or just the plain text. This ecosystem of tooling lowers the technical barriers to working with the dumps because users do not need to know the details of XML or wikitext.

While convenient, there are severe drawbacks to working with the XML dumps containing articles in wikitext. In fact, MediaWiki translates wikitext into HTML which is then displayed to the readers. Thus, some elements contained in the HTML version of the article are not readily available in the wikitext version; for example, due to the use of templates. This means that parsing only wikitext means that researchers might ignore important content which is displayed to readers. For example, a study by Mitrevski et al. found for English Wikipedia that from the 475M internal links in the HTML versions of the articles, only 171M (36%) were present in the wikitext version.

Therefore, it is often desirable to work with HTML versions of the articles instead of using the wikitext versions. Though, in practice this has remained largely impossible for researchers. Using the MediaWiki APIs or scraping Wikipedia directly for the HTML is computationally expensive at scale and discouraged for large projects. Only recently, the Wikimedia Enterprise HTML dumps have been introduced and made publicly available with regular monthly updates so that researchers or anyone else may use them in their work.Β 

However, while the data is available, it still requires lots of technical expertise by researchers, such as how different elements from wikitext get parsed into HTML elements. In order to lower the technical barriers and improve the accessibility of this incredible resource, we released the first version of mwparserfromhtml, a library that makes it easy to parse the HTML content of Wikipedia articles – inspired by the wikitext-oriented mwparserfromhell.

Elements of an article mwparserfromhtml can extract from an article
Figure 1. Examples of different types of elements that mwparserfromhtml can extract from an article

The tool is written in Python and available as a pip-installable package. It provides two main functionalities. First, it allows the user to access all articles in the dump files one by one in an iterative fashion. Second, it contains a parser for the individual HTML of the article. Using the Python library beautifulsoup, we can parse the content of the HTML and extract individual elements (see Figure 1 for examples):

  • Wikilinks (or internal links). These are annotated with additional information about the namespace of the target link or whether it is disambiguation page, redirect, red link, or interwiki link.
  • External links. We distinguish whether it is named, numbered, or autolinked.
  • CategoriesΒ 
  • Templates
  • ReferencesΒ 
  • Media. We capture the type of media (image, audio, or video) as well as the caption and alt text (if applicable).
  • Plain text of the articles

We also extract some properties of the elements that end users might care about, such as whether each element was originally included in the wikitext version or was transcluded from another page.

Building the tool posed several challenges. First, it remains difficult to systematically test the output of the tool. While we can verify that we are correctly extracting the total number of links in an article, there is no β€œright” answer for what the plain text of an article should include. For example, should image captions or lists be included? We manually annotated a handful of example articles in English to evaluate the tool’s output, but it is almost certain that we have not captured all possible edge cases. In addition, other language versions of Wikipedia might provide other elements or patterns in the HTML than the tool currently expects. Second, while much of how an article is parsed is handled by the core of MediaWiki and well documented by the Wikimedia Foundation Content Transform Team and the editor community on English Wikipedia, article content can also be altered by wiki-specific Extensions. This includes important features such as citations, and documentation about some of these aspects can be scarce or difficult to track down.Β 

The current version of mwparserfromhtml constitutes a first starting point. There are still many functionalities that we would like to add in the future, such as extracting tables, splitting the plain text into sections and paragraphs, or handing in-line templates used for unit conversion (for example displaying lbs and kg). If you have suggestions for improvements or would like to contribute, please reach out to us on the repository, and file an issue or submit a merge request.

Finally, we want to acknowledge that the project was started as part of an Outreachy internship with the Wikimedia Foundation. We encourage folks to consider mentoring or applying to the Outreachy program as appropriate.Β 


About this post

Featured image credit:Β ΠžΡ‡ΠΈΡΡ‚ΠΊΠ° Ρ€Ρ‚ΡƒΡ‚ΠΈ ΠΏΠ΅Ρ€Π΅Π³ΠΎΠ½ΠΊΠΎΠΉ Π² Ρ‚ΠΎΠΊΠ΅ Π³Π°Π·Π°.png in theΒ public domain

Figure 1 image credit:Β Mwparserfromhtml functionality.gif by Isaac (WMF) licensed under theΒ Creative CommonsΒ Attribution-Share Alike 4.0 InternationalΒ license

❌
❌