Reading view

There are new articles available, click to refresh the page.

The length of file names in early Unix

By: cks

25 May 2025 at 01:33

If you use Unix today, you can enjoy relatively long file names on more or less any filesystem that you care to name. But it wasn't always this way. Research V7 had 14-byte filenames, and the System III/System V lineage continued this restriction until it merged with BSD Unix, which had significantly increased this limit as part of moving to a new filesystem (initially called the 'Fast File System', for good reasons). You might wonder where this unusual number came from, and for that matter, what the file name limit was on very early Unixes (it was 8 bytes, which surprised me; I vaguely assumed that it had been 14 from the start).

I've mentioned before that the early versions of Unix had a quite simple format for directory entries. In V7, we can find the directory structure specified in sys/dir.h (dir(5) helpfully directs you to sys/dir.h), which is so short that I will quote it in full:

#ifndef	DIRSIZ
#define	DIRSIZ	14
#endif
struct	direct
{
    ino_t    d_ino;
    char     d_name[DIRSIZ];
};

To fill in the last blank, ino_t was a 16-bit (two byte) unsigned integer (and field alignment on PDP-11s meant that this structure required no padding), for a total of 16 bytes. This directory structure goes back to V4 Unix. In V3 Unix and before, directory entries were only ten bytes long, with 8 byte file names.

(Unix V4 (the Fourth Edition) was when the kernel was rewritten in C, so that may have been considered a good time to do this change. I do have to wonder how they handled the move from the old directory format to the new one, since Unix at this time didn't have multiple filesystem types inside the kernel; you just had the filesystem, plus all of your user tools knew the directory structure.)

One benefit of the change in filename size is that 16-byte directory entries fit evenly in 512-byte disk blocks (or other powers-of-two buffer sizes). You never have a directory entry that spans two disk blocks, so you can deal with directories a block at a time. Ten byte directory entries don't have this property; eight-byte ones would, but then that would leave space for only six character file names, and presumably that was considered too small even in Unix V1.

PS: That inode numbers in V7 (and earlier) were 16-bit unsigned integers does mean what you think it means; there could only be at most 65,536 inodes in a single classical V7 filesystem. If you needed more files, you had better make more filesystems. Early Unix had a lot of low limits like that, some of them quite hard-coded.

(3 comments.)

The lack of a good command line way to sort IPv6 addresses

Chris's Wiki :: blog

By: cks

19 May 2025 at 02:54

A few years ago, I wrote about how 'sort -V' can sort IPv4 addresses into their natural order for you. Even back then I was smart enough to put in that 'IPv4' qualification and note that this didn't work with IPv6 addresses, and said that I didn't know of any way to handle IPv6 addresses with existing command line tools. As far as I know, that remains the case today, although you can probably build a Perl, Python, or other language program that does such sorting for you if you need to do this regularly.

Unix tools like 'sort' are pretty flexible, so you might innocently wonder why it can't be coerced into sorting IPv6 addresses. The first problem is that IPv6 addresses are written in hex without leading 0s, not decimal. Conventional sort will correctly sort hex numbers if all of the numbers are the same length, but IPv6 addresses are written in hex groups that conventionally drop leading zeros, so you will have 'ff' instead of '00ff' in common output (or '0' instead of '0000'). The second and bigger problem is the IPv6 '::' notation, which stands for the longest run of all-zero fields, ie some number of '0000' fields.

(I'm ignoring IPv6 scopes and zones for this, let's assume we have public IPv6 addresses.)

If IPv6 addresses were written out in full, with leading 0s on fields and all their 0000 fields, you could handle them as a simple conventional sort (you wouldn't even need to tell sort that the field separator was ':'). Unfortunately they almost never are, so you need to either transform them to that form, print them out, sort the output, and perhaps transform them back, or read them into a program as 128-bit numbers, sort the numbers, and print them back out as IPv6 addresses. Ideally your language of choice for this has a way to sort a collection of IPv6 addresses.

The very determined can probably do this with awk with enough work (people have done amazing things in awk). But my feeling is that doing this in conventional Unix command line tools is a Turing tarpit; you might as well use a language where there's a type of IPv6 addresses that exposes the functionality that you need.

(And because IPv6 addresses are so complex, I suspect that GNU Sort will never support them directly. If you need GNU Sort to deal with them, the best option is a program that turns them into their full form.)

PS: People have probably written programs to sort IPv6 addresses, but with the state of the Internet today, the challenge is finding them.

(4 comments.)

Some notes on using 'join' to supplement one file with data from another

Chris's Wiki :: blog

By: cks

10 May 2025 at 02:45

Recently I said something vaguely grumpy about the venerable Unix 'join' tool. As the POSIX specification page for join will unhelpfully tell you, join is a 'relational database operator', which means that it implements the rough equivalent of SQL joins. One way to use join is to add additional information for some lines in your input data.

Suppose, not entirely hypothetically, that we have an input file (or data stream) that starts with a login name and contains some additional information, and that for some logins (but not all of them) we have useful additional data about them in another file. Using join, the simple case of this is easy, if the 'master' and 'suppl' files are already sorted:

join -1 1 -2 1 -a 1 master suppl

(I'm sticking to POSIX syntax here. Some versions of join accept '-j 1' as an alternative to '-1 1 -2 1'.)

Our specific options tell join to join each line of 'master' and 'suppl' on the first field in each (the login) and print them, and also print all of the lines from 'master' that didn't have a login in 'suppl' (that's the '-a 1' argument). For lines with matching logins, we get all of the fields from 'master' and then all of the extra fields from 'suppl'; for lines from 'master' that don't match, we just get the fields from 'master'. Generally you'll tell apart which lines got supplemented and which ones didn't by how many fields they have.

If we want something other than all of the fields in the order that they are in the existing data source, in theory we have the '-o <list>' option to tell join what fields from each source to output. However, this option has a little problem, which I will show you by quoting the important bit from the POSIX standard (emphasis mine):

The fields specified by list shall be written for all selected output lines. Fields selected by list that do not appear in the input shall be treated as empty output fields.

What that means is that if we're also printing non-joined lines from our 'master' file, our '-o' still applies and any fields we specified from 'suppl' will be blank and empty (unless you use '-e'). This can be inconvenient if you were re-ordering fields so that, for example, a field from 'suppl' was listed before some fields from 'master'. It also means that you want to use '1.1' to get the login from 'master', which is always going to be there, not '2.1', the login from 'suppl', which is only there some of the time.

(All of this assumes that your supplementary file is listed second and the master file first.)

On the other hand, using '-e' we can simplify life in some situations. Suppose that 'suppl' contains only one additional interesting piece of information, and it has a default value that you'll use if 'suppl' doesn't contain a line for the login. Then if 'master' has three fields and 'suppl' two, we can write:

join -1 1 -2 1 -a 1 -e "$DEFVALUE" -o '1.1,1.2,1.3,2.2' master suppl

Now we don't have to try to tell whether or not a line from 'master' was supplemented by counting how many fields it has; everything has the same number of fields, it's just sometimes the last (supplementary) field is the default value.

(This is harder to apply if you have multiple fields from the 'suppl' file, but possibly you can find a 'there is nothing here' value that works for the rest of your processing.)

The appeal of keyboard launchers for (Unix) desktops

Chris's Wiki :: blog

By: cks

30 April 2025 at 03:41

A keyboard launcher is a big part of my (modern) desktop, but over on the Fediverse I recently said something about them in general:

I don't necessarily suggest that people use dmenu or some equivalent. Keyboard launchers in GUI desktops are an acquired taste and you need to do a bunch of setup and infrastructure work before they really shine. But if you like driving things by the keyboard and will write scripts, dmenu or equivalents can be awesome.

The basic job of a pure keyboard launcher is to let you hit a key, start typing, and then select and do 'something'. Generally the keyboard launcher will make a window appear so that you can see what you're typing and maybe what you could complete it to or select.

The simplest and generally easiest way to use a keyboard launcher, and how many of them come configured to work, is to use it to select and run programs. You can find a version of this idea in GNOME, and even Windows has a pseudo-launcher in that you can hit a key to pop up the Start menu and the modern Start menu lets you type in stuff to search your programs (and other things). One problem with the GNOME version, and many basic versions, is that in practice you don't necessarily launch desktop programs all that often or launch very many different ones, so you can have easier ways to invoke the ones you care about. One problem with the Windows version (at least in my experience) is that it will do too much, which is to say that no matter what garbage you type into it by accident, it will do something with that garbage (such as launching a web search).

The happy spot for a keyboard launcher is somewhere in the middle, where they do a variety of things that are useful for you but not without limits. The best window launcher for your desktop is one that gives you fast access to whatever things you do a lot, ideally with completion so you type as little as possible. When you have it tuned up and working smoothly the feel is magical; I tap a key, type a couple of characters and then hit tab, hit return, and the right thing happens without me thinking about it, all fast enough that I can and do type ahead blindly (which then goes wrong if the keyboard launcher doesn't start fast enough).

The problem with keyboard launchers, and why they're not for everyone, is that everyone has a different set of things that they do a lot and that are useful for them to trigger entirely through the keyboard. No keyboard launcher will come precisely set up for what you do a lot in their default installation, so at a minimum you need to spend the time and effort to curate what the launcher will do and how it does it. If you're more ambitious, you may need to build supporting scripts that give the launcher a list of things to complete and then act on them when you complete one. If you don't curate the launcher and throw in the kitchen sink, you wind up with the Windows experience where it will certainly do something when you type things but perhaps not really what you wanted.

(For example, I routinely ssh to a lot of our machines, so my particular keyboard launcher setup lets me type a machine name (with completion) to start a session to it. But I had to build all of that, including sourcing the machine names I wanted included from somewhere, and this isn't necessarily useful for people who aren't constantly ssh'ing to machines.)

There are a variety of keyboard launchers for both X and Wayland, basically none of which I have any experience with. See the Arch Wiki section on application launchers. Someday I will have to get a Wayland equivalent to my particular modified dmenu, a thought that fills me with no more enthusiasm than any other part of replacing my whole X environment.

PS: Another issue with keyboard launchers is that sometimes you're wrong about what you want to do with them. I once built an entire keyboard launcher setup to select terminal windows and then later wound up abandoning it when I didn't use it enough.

Unix files have (at least) two sizes

Chris's Wiki :: blog

By: cks

14 April 2025 at 03:02

I'll start by presenting things in illustrated form:

; ls -l testfile
-rw-r--r-- 1 cks 262144 Apr 13 22:03 testfile
; ls -s testfile
1 testfile
; ls -slh testfile
512 -rw-r--r-- 1 cks 256K Apr 13 22:03 testfile

The two well known sizes that Unix files have are the logical 'size' in bytes and what stat.h describes as "the number of blocks allocated for this object", often converted to some number of bytes (as ls is doing here in the last command). A file's size in bytes is roughly speaking the last file offset that has been written to in the file, and not all of the bytes covered by it may have actually been written; when this is the case, the result is a sparse file. Sparse files are the traditional cause of a mismatch between the byte size and the number of blocks a file uses. However, that is not what is happening here.

This file is on a ZFS filesystem with ZFS's compression turned on, and it was created with 'dd if=/dev/zero of=testfile bs=1k count=256'. In ZFS, zeroes compress extremely well, and so ZFS has written basically no physical data blocks and faithfully reported that (minimal) number in the stat() st_blocks field. However, at the POSIX level we have indeed written data to all 256 KBytes of the file; it's not a sparse file. This is an extreme example of filesystem compression, and there are plenty of lesser ones.

This leaves us with a third size, which is the number of logical blocks for this file. When a filesystem is doing data compression, this number will be different from the number of physical blocks used. As far as I can tell, the POSIX stat.h description doesn't specify which one you have to report for st_blocks. As we can see, ZFS opts to report the physical block size of the file, which is probably the more useful number for the purposes of things like 'du'. However, it does leave us with no way of finding out the logical block size, which we may care about for various reasons (for example, if our backup system can skip unwritten sparse blocks but always writes out uncompressed blocks).

This also implies that a non-sparse file can change its st_blocks number if you move it from one filesystem to another. One filesystem might have compression on and the other one have it off, or they might have different compression algorithms that give different results. In some cases this will cause the file's space usage to expand so that it doesn't actually fit into the new filesystem (or for a tree of files to expand their space usage).

(I don't know if there are any Unix filesystems that report the logical block size in st_blocks and only report the physical block size through a private filesystem API, if they report it at all.)

(3 comments.)

One way to set up local programs in a multi-architecture Unix environment

Chris's Wiki :: blog

By: cks

11 April 2025 at 02:29

Back in the old days, it used to be reasonably routine to have 'multi-architecture' Unix environments with shared files (where here architecture was a combination of the process architecture and the Unix variant). The multi-architecture days have faded out, and with them fading, so has information about how people made this work with things like local binaries.

In the modern era of large local disks and build farms, the default approach is probably to simply build complete copies of '/local' for each architecture type and then distribute the result around somehow. In the old days people were a lot more interested in reducing disk space by sharing common elements and then doing things like NFS-mounting your entire '/local', which made life more tricky. There likely were many solutions to this, but the one I learned at the university as a young sprout worked like the following.

The canonical paths everyone used and had in their $PATH were things like /local/bin, /local/lib, /local/man, and /local/share. However, you didn't (NFS) mount /local; instead, you NFS mounted /local/mnt (which was sort of an arbitrary name, as we'll see). In /local/mnt there were 'share' and 'man' directories, and also a per-architecture directory for every architecture you supported, with names like 'solaris-sparc' or 'solaris-x86'. These per-architecture directories contained 'bin', 'lib', 'sbin', and so on subdirectories.

(These directories contained all of the locally installed programs, all jumbled together, which did have certain drawbacks that became more and more apparent as you added more programs.)

Each machine had a /local directory on its root filesystem that contained /local/mnt, symlinks from /local/share and /local/man to 'mnt/share' and 'mnt/man', and then symlinks for the rest of the directories that went to 'mnt/<arch>/bin' (or sbin or lib). Then everyone mounted /local/mnt on, well, /local/mnt. Since /local and its contents were local to the machine, you could have different symlinks on each machine that used the appropriate architecture (and you could even have built them on boot if you really wanted to, although in practice they were created when the machine was installed).

When you built software for this environment, you told it that its prefix was /local, and let it install itself (on a suitable build server) using /local/bin, /local/lib, /local/share and so on as the canonical paths. You had to build (and install) software repeatedly, once for each architecture, and it was on the software (and you) to make sure that /local/share/<whatever> was in fact the same from architecture to architecture. System administrators used to get grumpy when people accidentally put architecture dependent things in their 'share' areas, but generally software was pretty good about this in the days when it mattered.

(In some variants of this scheme, the mount points were a bit different because the shared stuff came from one NFS server and the architecture dependent parts from another, or might even be local if your machine was the only instance of its particular architecture.)

There were much more complicated schemes that various places did (often universities), including ones that put each separate program or software system into its own directory tree and then glued things together in various ways. Interested parties can go through LISA proceedings from the 1980s and early 1990s.

(3 comments.)

The pragmatics of doing fsync() after a re-open() of journals and logs

Chris's Wiki :: blog

By: cks

25 March 2025 at 02:02

Recently I read Rob Norris' fsync() after open() is an elaborate no-op (via). This is a contrarian reaction to the CouchDB article that prompted my entry Always sync your log or journal files when you open them. At one level I can't disagree with Norris and the article; POSIX is indeed very limited about the guarantees it provides for a successful fsync() in a way that frustrates the 'fsync after open' case.

At another level, I disagree with the article. As Norris notes, there are systems that go beyond the minimum POSIX guarantees, and also the fsync() after open() approach is almost the best you can do and is much faster than your other (portable) option, which is to call sync() (on Linux you could call syncfs() instead). Under POSIX, sync() is allowed to return before the IO is complete, but at least sync() is supposed to definitely trigger flushing any unwritten data to disk, which is more than POSIX fsync() provides you (as Norris notes, POSIX permits fsync() to apply only to data written to that file descriptor, not all unwritten data for the underlying file). As far as fsync() goes, in practice I believe that almost all Unixes and Unix filesystems are going to be more generous than POSIX requires and fsync() all dirty data for a file, not just data written through your file descriptor.

Actually being as restrictive as POSIX allows would likely be a problem for Unix kernels. The kernel wants to index the filesystem cache by inode, including unwritten data. This makes it natural for fsync() to flush all unwritten data associated with the file regardless of who wrote it, because then the kernel needs no extra data to be attached to dirty buffers. If you wanted to be able to flush only dirty data associated with a file object or file descriptor, you'd need to either add metadata associated with dirty buffers or index the filesystem cache differently (which is clearly less natural and probably less efficient).

Adding metadata has an assortment of challenges and overheads. If you add it to dirty buffers themselves, you have to worry about clearing this metadata when a file descriptor is closed or a file object is deallocated (including when the process exits). If you instead attach metadata about dirty buffers to file descriptors or file objects, there's a variety of situations where other IO involving the buffer requires updating your metadata, including the kernel writing out dirty buffers on its own without a fsync() or a sync() and then perhaps deallocating the now clean buffer to free up memory.

Being as restrictive as POSIX allows probably also has low benefits in practice. To be a clear benefit, you would need to have multiple things writing significant amounts of data to the same file and fsync()'ing their data separately; this is when the file descriptor (or file object) specific fsync() saves you a bunch of data write traffic over the 'fsync() the entire file' approach. But as far as I know, this is a pretty unusual IO pattern. Much of the time, the thing fsync()'ing the file is the only writer, either because it's the only thing dealing with the file or because updates to the file are being coordinated through it so that processes don't step over each other.

PS: If you wanted to implement this, the simplest option would be to store the file descriptor and PID (as numbers) as additional metadata with each buffer. When the system fsync()'d a file, it could check the current file descriptor number and PID against the saved ones and only flush buffers where they matched, or where these values had been cleared to signal an uncertain owner. This would flush more than strictly necessary if the file descriptor number (or the process ID) had been reused or buffers had been touched in some way that caused the kernel to clear the metadata, but doing more work than POSIX strictly requires is relatively harmless.

Sidebar: fsync() and mmap() in POSIX

Under a strict reading of the POSIX fsync() specification, it's not entirely clear how you're properly supposed to fsync() data written through mmap() mappings. If 'all data for the open file descriptor' includes pages touched through mmap(), then you have to keep the file descriptor you used for mmap() open, despite POSIX mmap() otherwise implicitly allowing you to close it; my view is that this is at least surprising. If 'all data' only includes data directly written through the file descriptor with system calls, then there's no way to trigger a fsync() for mmap()'d data.

(One comment.)

The obviousness of indexing the Unix filesystem buffer cache by inodes

Chris's Wiki :: blog

By: cks

24 March 2025 at 02:34

Like most operating systems, Unix has an in-memory cache of filesystem data. Originally this was a fixed size buffer cache that was maintained separately from the memory used by processes, but later it became a unified cache that was used for both memory mappings established through mmap() and regular read() and write() IO (for good reasons). Whenever you have a cache, one of the things you need to decide is how the cache is indexed. The more or less required answer for Unix is that the filesystem cache is indexed by inode (and thus filesystem, as inodes are almost always attached to some filesystem).

Unix has three levels of indirection for straightforward IO. Processes open and deal with file descriptors, which refer to underlying file objects, which in turn refer to an inode. There are various situations, such as calling dup(), where you will wind up with two file descriptors that refer to the same underlying file object. Some state is specific to file descriptors, but other state is held at the level of file objects, and some state has to be held at the inode level, such as the last modification time of the inode. For mmap()'d files, we have a 'virtual memory area', which is a separate level of indirection that is on top of the inode.

The biggest reason to index the filesystem cache by inode instead of file descriptor or file object is coherence. If two processes separately open the same file, getting two separate file objects and two separate file descriptors, and then one process writes to the file while the other reads from it, we want the reading process to see the data that the writing process has written. The only thing the two processes naturally share is the inode of the file, so indexing the filesystem cache by inode is the easiest way to provide coherence. If the kernel indexed by file object or file descriptor, it would have to do extra work to propagate updates through all of the indirection. This includes the 'updates' of reading data off disk; if you index by inode, everyone reading from the file automatically sees fetched data with no extra work.

(Generally we also want this coherence for two processes that both mmap() the file, and for one process that mmap()s the file while another process read()s or write()s to it. Again this is easiest to achieve if everything is indexed by the inode.)

Another reason to index by inode is how easy it is to handle various situations in the filesystem cache when things are closed or removed, especially when the filesystem cache holds writes that are being buffered in memory before being flushed to disk. Processes frequently close file descriptors and drop file objects, including by exiting, but any buffered writes still need to be findable so they can be flushed to disk before, say, the filesystem itself is unmounted. Similarly, if an inode is deleted we don't want to flush its pending buffered writes to disk (and certainly we can't allocate blocks for them, since there's nothing to own those blocks any more), and we want to discard any clean buffers associated with it to free up memory. If you index the cache by inode, all you need is for filesystems to be able to find all their inodes; everything else more or less falls out naturally.

This doesn't absolutely require a Unix to index its filesystem buffer caches by inode. But I think it's clearly easiest to index the filesystem cache by inode, instead of the other available references. The inode is the common point for all IO involving a file (partly because it's what filesystems deal with), which makes it the easiest index; everyone has an inode reference and in a properly implemented Unix, everyone is using the same inode reference.

(In fact all sorts of fun tend to happen in Unixes if they have a filesystem that gives out different in-kernel inodes that all refer to the same on-disk filesystem object. Usually this happens by accident or filesystem bugs.)

Why I have a little C program to filter a $PATH (more or less)

Chris's Wiki :: blog

By: cks

16 February 2025 at 02:07

I use a non-standard shell and have for a long time, which means that I have to write and maintain my own set of dotfiles (which sometimes has advantages). In the long ago days when I started doing this, I had a bunch of accounts on different Unixes around the university (as was the fashion at the time, especially if you were a sysadmin). So I decided that I was going to simplify my life by having one set of dotfiles for rc that I used on all of my accounts, across a wide variety of Unixes and Unix environments. That way, when I made an improvement in a shell function I used, I could get it everywhere by just pushing out a new version of my dotfiles.

(This was long enough ago that my dotfile propagation was mostly manual, although I believe I used rdist for some of it.)

In the old days, one of the problems you faced if you wanted a common set of dotfiles across a wide variety of Unixes was that there were a lot of things that potentially could be in your $PATH. Different Unixes had different sets of standard directories, and local groups put local programs (that I definitely wanted access to) in different places. I could have put everything in $PATH (giving me a gigantic one) or tried to carefully scope out what system environment I was on and set an appropriate $PATH for each one, but I decided to take a more brute force approach. I started with a giant potential $PATH that listed every last directory that could appear in $PATH in any system I had an account on, and then I had a C program that filtered that potential $PATH down to only things that existed on the local system. Because it was written in C and had to stat() things anyways, I made it also keep track of what concrete directories it had seen and filter out duplicates, so that if there were symlinks from one name to another, I wouldn't get it twice in my $PATH.

(Looking at historical copies of the source code for this program, the filtering of duplicates was added a bit later; the very first version only cared about whether a directory existed or not.)

The reason I wrote a C program for this (imaginatively called 'isdirs') instead of using shell builtins to do this filtering (which is entirely possible) is primarily because this was so long ago that running a C program was definitely faster than using shell builtins in my shell. I did have a fallback shell builtin version in case my C program might not be compiled for the current system and architecture, although it didn't do the filtering of duplicates.

(Rc uses a real list for its equivalent of $PATH instead of the awkward ':' separated pseudo-list that other Unix shells use, so both my C program and my shell builtin could simply take a conventional argument list of directories rather than having to try to crack a $PATH apart.)

(This entry was inspired by Ben Zanin's trick(s) to filter out duplicate $PATH entries (also), which prompted me to mention my program.)

PS: rc technically only has one dotfile, .rcrc, but I split my version up into several files that did different parts of the work. One reason for this split was so that I could source only some parts to set up my environment in a non-interactive context (also).

Sidebar: the rc builtin version

Rc has very few builtins and those builtins don't include test, so this is a bit convoluted:

path=`{tpath=() pe=() {
        for (pe in $path)
           builtin cd $pe >[1=] >[2=] && tpath=($tpath $pe)
        echo $tpath
       } >[2]/dev/null}

In a conventional shell with a test builtin, you would just use 'test -d' to see if directories were there. In rc, the only builtin that will tell you if a directory exists is to try to cd to it. That we change directories is harmless because everything is running inside the equivalent of a Bourne shell $(...).

Keen eyed people will have noticed that this version doesn't work if anything in $path has a space in it, because we pass the result back as a whitespace-separated string. This is a limitation shared with how I used the C program, but I never had to use a Unix where one of my $PATH entries needed a space in it.

(2 comments.)

The profusion of things that could be in your $PATH on old Unixes

Chris's Wiki :: blog

By: cks

15 February 2025 at 03:43

In the beginning, which is to say the early days of Bell Labs Research Unix, life was simple and there was only /bin. Soon afterwards that disk ran out of space and we got /usr/bin (and all of /usr), and some people might even have put /etc on their $PATH. When UCB released BSD Unix, they added /usr/ucb as a place for (some of) their new programs and put some more useful programs in /etc (and at some point there was also /usr/etc); now you had three or four $PATH entries. When window systems showed up, people gave them their own directories too, such as /usr/bin/X11 or /usr/openwin/bin, and this pattern was followed by other third party collections of programs, with (for example) /usr/bin/mh holding all of the (N)MH programs (if you installed them there). A bit later, SunOS 4.0 added /sbin and /usr/sbin and other Unixes soon copied them, adding yet more potential $PATH entries.

(Sometimes X11 wound up in /usr/X11/bin, or /usr/X11<release>/bin. OpenBSD still has a /usr/X11R6 directory tree, to my surprise.)

When Unix went out into the field, early system administrators soon learned that they didn't want to put local programs into /usr/bin, /usr/sbin, and so on. Of course there was no particular agreement on where to put things, so people came up with all sorts of options for the local hierarchy, including /usr/local, /local, /slocal, /<group name> (such as /csri or /dgp), and more. Often these /local/bin things had additional subdirectories for things like the locally built version of X11, which might be plain 'bin/X11' or have a version suffix, like 'bin/X11R4', 'bin/X11R5', or 'bin/X11R6'. Some places got more elaborate; rather than putting everything in a single hierarchy, they put separate things into separate directory hierarchies. When people used /opt for this, you could get /opt/gnu/bin, /opt/tk/bin, and so on.

(There were lots of variations, especially for locally built versions of X11. And a lot of people built X11 from source in those days, at least in the university circles I was in.)

Unix vendors didn't sit still either. As they began adding more optional pieces they started splitting them up into various directory trees, both for their own software and for third party software they felt like shipping. Third party software was often planted into either /usr/local or /usr/contrib, although there were other options, and vendor stuff could go in many places. A typical example is Solaris 9's $PATH for sysadmins (and I think that's not even fully complete, since I believe Solaris 9 had some stuff hiding under /usr/xpg4). Energetic Unix vendors could and did put various things in /opt under various names. By this point, commercial software vendors that shipped things for Unixes also often put them in /opt.

This led to three broad things for people using Unixes back in those days. First, you invariably had a large $PATH, between all of the standard locations, the vendor additions, and the local additions on top of those (and possibly personal 'bin' directories in your $HOME). Second, there was a lot of variation in the $PATH you wanted, both from Unix to Unix (with every vendor having their own collection of non-standard $PATH additions) and from site to site (with sysadmins making all sorts of decisions about where to put local things). Third, setting yourself up on a new Unix often required a bunch of exploration and digging. Unix vendors often didn't add everything that you wanted to their standard $PATH, for example. If you were lucky and got an account at a well run site, their local custom new account dotfiles would set you up with a correct and reasonably complete local $PATH. If you were a sysadmin exploring a new to you Unix, you might wind up writing a grumpy blog entry.

(This got much more complicated for sites that had a multi-Unix environment, especially with shared home directories.)

Modern Unix life is usually at least somewhat better. On Linux, you're typically down to two main directories (/usr/bin and /usr/sbin) and possibly some things in /opt, depending on local tastes. The *BSDs are a little more expansive but typically nowhere near the heights of, for example, Solaris 9's $PATH (see the comments on that entry too).

(3 comments.)

How to accidentally get yourself with 'find ... -name something*'

Chris's Wiki :: blog

By: cks

28 January 2025 at 03:43

Suppose that you're in some subdirectory /a/b/c, and you want to search all of /a for the presence of files for any version of some program:

u@h:/a/b/c$ find /a -name program* -print

This reports '/a/b/c/program-1.2.tar' and '/a/b/f/program-1.2.tar', but you happen to know that there are other versions of the program under /a. What happened to a command that normally works fine?

As you may have already spotted, what happened is the shell's wildcard expansion. Because you ran your find in a directory that contained exactly one match for 'program*', the shell expanded it before you ran find, and what you actually ran was:

find /a -name program-1.2.tar -print

This reported the two instances of program-1.2.tar in the /a tree, but not the program-1.4.1.tar that was also in the /a tree.

If you'd run your find command in a directory without a shell match for the -name wildcard, the shell would (normally) pass the unexpanded wildcard through to find, which would do what you want. And if there had been only one instance of 'program-1.2.tar' in the tree, in your current directory, it might have been more obvious what went wrong; instead, the find returning more than one result made it look like it was working normally apart from inexplicably not finding and reporting 'program-1.4.1.tar'.

(If there were multiple matches for the wildcard in the current directory, 'find' would probably have complained and you'd have realized what was going on.)

Some shells have options to cause failed wildcard expansions to be considered an error; Bash has the 'failglob' shopt, for example. People who turn these options on are probably not going to stumble into this because they've already been conditioned to quote wildcards for 'find -name' and other similar tools. Possibly this Bash option or its equivalent in other shells should be the default for new Unix accounts, just so everyone gets used to quoting wildcards that are supposed to be passed through to programs.

(Although I don't use a shell that makes failed wildcard expansions an error, I somehow long ago internalized the idea that I should quote all wildcards I want to pass to programs.)

(3 comments.)

The history and use of /etc/glob in early Unixes

Chris's Wiki :: blog

By: cks

13 January 2025 at 04:41

One of the innovations that the V7 Bourne shell introduced was built in shell wildcard globbing, which is to say expanding things like *, ?, and so on. Of course Unix had shell wildcards well before V7, but in V6 and earlier, the shell didn't implement globbing itself; instead this was delegated to an external program, /etc/glob (this affects things like looking into the history of Unix shell wildcards, because you have to know to look at the glob source, not the shell).

As covered in places like the V6 glob(8) manual page, the glob program was passed a command and its arguments (already split up by the shell), and went through the arguments to expand any wildcards it found, then exec()'d the command with the now expanded arguments. The shell operated by scanning all of the arguments for (unescaped) wildcard characters. If any were found, the shell exec'd /etc/glob with the whole show; otherwise, it directly exec()'d the command with its arguments. Quoting wildcards used a hack that will be discussed later.

This basic /etc/glob behavior goes all the way back to Unix V1, where we have sh.s and in it we can see that invocation of /etc/glob. In V2, glob is one of the programs that have been rewritten in C (glob.c), and in V3 we have a sh.1 that mentions /etc/glob and has an interesting BUGS note about it:

If any argument contains a quoted "*", "?", or "[", then all instances of these characters must be quoted. This is because sh calls the glob routine whenever an unquoted "*", "?", or "[" is noticed; the fact that other instances of these characters occurred quoted is not noticed by glob.

This section has disappeared in the V4 sh.1 manual page, which suggests that the V4 shell and /etc/glob had acquired the hack they use in V5 and V6 to avoid this particular problem.

How escaping wildcards works in the V5 and V6 shell is that all characters in commands and arguments are restricted to being seven-bit ASCII. The shell and /etc/glob both use the 8th bit to mark quoted characters, which means that such quoted characters don't match their unquoted versions and won't be seen as wildcards by either the shell (when it's deciding whether or not it needs to run /etc/glob) or by /etc/glob itself (when it's deciding what to expand). However, obviously neither the shell nor /etc/glob can pass such 'marked as quoted' characters to actual commands, so each of them strips the high bit from all characters before exec()'ing actual commands.

(This is clearer in the V5 glob.c source; look for how cat() ands every character with octal 0177 (0x7f) to drop the high bit. You can also see it in the V5 sh.c source, where you want to look at trim(), and also the #define for 'quote' at the start of sh.c and how it's used later.)

PS: I don't know why expanding shell wildcards used a separate program in V6 and earlier, but part of it may have been to keep the shell smaller and more minimal so that it required less memory.

PPS: See also Stephen R. Bourne's 2015 presentation from BSDCan [PDF], which has a bunch of interesting things on the V7 shell and confirms that /etc/glob was there from V1.

(2 comments.)

What a FreeBSD kernel message about your bridge means

Chris's Wiki :: blog

By: cks

8 January 2025 at 03:58

Suppose, not hypothetically, that you're operating a FreeBSD based bridging firewall (or some other bridge situation) and you see something like the following kernel message:

kernel: bridge0: mac address 01:02:03:04:05:06 vlan 0 moved from ix0 to ix1
kernel: bridge0: mac address 01:02:03:04:05:06 vlan 0 moved from ix1 to ix0

The bad news is that this message means what you think it means. Your FreeBSD bridge between ix0 and ix1 first saw this MAC address as the source address on a packet it received on the ix0 interface of the bridge, and then it saw the same MAC address as the source address of a packet received on ix1, and then it received another packet on ix0 with that MAC address as the source address. Either you have something echoing those packets back on one side, or there is a network path between the two sides that bypasses your bridge.

(If you're lucky this happens regularly. If you're not lucky it happens only some of the time.)

This particular message comes from bridge_rtupdate() in sys/net/if_bridge.c, which is called to update the bridge's 'routing entries', which here means MAC addresses, not IP addresses. This function is called from bridge_forward(), which forwards packets, which is itself called from bridge_input(), which handles received packets. All of this only happens if the underlying interfaces are in 'learning' mode, but this is the default.

As covered in the ifconfig manual page, you can inspect what MAC addresses have been learned on which device with 'ifconfig bridge0 addr' (covered in the 'Bridge Interface Parameters' section of the manual page). This may be useful to see if your bridge normally has a certain MAC address (perhaps the one that's moving) on the interface it should be on. If you want to go further, it's possible to set a static mapping for some MAC addresses, which will make them stick to one interface even if seen on another one.

Logging this message is controlled by the net.link.bridge.log_mac_flap sysctl, and it's rate limited to only being reported five times a second in general (using ppsratecheck()). That's five times total, even if each time is a different MAC address or even a different bridge. This 'five times a second' log count isn't controllable through a sysctl.

(I'm writing all of this down because I looked much of it up today. Sometimes I'm a system programmer who goes digging in the (FreeBSD) kernel source just to be sure.)

My unusual X desktop wasn't made 'from scratch' in a conventional sense

Chris's Wiki :: blog

By: cks

1 January 2025 at 04:10

There are people out there who set up unusual (Unix) environments for themselves from scratch; for example, Mike Hoye recently wrote Idiosyncra. While I have an unusual desktop, I haven't built it from scratch in quite the same way that Mike Hoye and other people have; instead I've wound up with my desktop through a rather easier process.

It would be technically accurate to say that my current desktop environment has been built up gradually over time (including over the time I've been writing Wandering Thoughts, such as my addition of dmenu). But this isn't really how it happened, in that I didn't start from a normal desktop and slowly change it into my current one. The real story is that the core of my desktop dates from the days when everyone's X desktops looked like mine does. Technically there were what we would call full desktops back in those days, if you had licensed the necessary software from your Unix vendor and chose to run it, but hardware was sufficiently slow back then that people at universities almost always chose to run more lightweight environments (especially since they were often already using the inexpensive and slow versions of workstations).

(Depending on how much work your local university system administrators had done, your new Unix account might start out with the Unix vendor's X setup, or it could start out with what X11R<whatever> defaulted to when built from source, or it might be some locally customized setup. In all cases you often were left to learn about the local tastes in X desktops and how to improve yours from people around you.)

To show how far back this goes (which is to say how little of it has been built 'from scratch' recently), my 1996 SGI Indy desktop has much of the look and the behavior of my current desktop, and its look and behavior wasn't new then; it was an evolution of my desktop from earlier Unix workstations. When I started using Linux, I migrated my Indy X environment to my new (and better) x86 hardware, and then as Linux has evolved and added more and more things you have to run to have a usable desktop with things like volume control, your SSH agent, and automatically mounted removable media, I've added them piece by piece (and sometimes updated them as how you do this keeps changing).

(At some point I moved from twm as my window manager to fvwm, but that was merely redoing my twm configuration in fvwm, not designing a new configuration from scratch.)

I wouldn't want to start from scratch today to create a new custom desktop environment; it would be a lot of work (and the one time I looked at it I wound up giving up). Someday I will have to move from X, fvwm, dmenu, and so on to some sort of Wayland based environment, but even when I do I expect to make the result as similar to my current X setup as I can, rather than starting from a clean sheet design. I know what I want because I'm very used to my current environment and I've been using variants of it for a very long time now.

(This entry was sparked by Ian Z aka nobrowser's comment on my entry from yesterday.)

PS: Part of the long lineage and longevity of my X desktop is that I've been lucky and determined enough to use Unix and X continuously at work, and for a long time at home as well. So I've never had a time when I moved away from X on my desktop(s) and then had to come back to reconstruct an environment and catch it up to date.

PPS: This is one of the roots of my xdm heresy, where my desktops boot into a text console and I log in there to manually start X with a personal script that's a derivative of the ancient startx command.

(4 comments.)

In an unconfigured Vim, I want to do ':set paste' right away

Chris's Wiki :: blog

By: cks

29 December 2024 at 03:53

Recently I wound up using a FreeBSD machine, where I promptly installed vim for my traditional reason. When I started modifying some files, I had contents to paste in from another xterm window, so I tapped my middle mouse button while in insert mode (ie, I did the standard xterm 'paste text' thing). You may imagine the 'this is my face' meme when what vim inserted was the last thing I'd deleted in vim on that FreeBSD machine, instead of my X text selection.

For my future use, the cure for this is ':set paste', which turns off basically all of vim's special handling of pasted text. I've traditionally used this to override things like vim auto-indenting or auto-commenting the text I'm pasting in, but it also turns off vim's special mouse handling, which is generally active in terminal windows, including over SSH.

(The defaults for ':set mouse' seem to vary from system to system and probably vim build to vim build. For whatever reason, this FreeBSD system and its vim defaulted to 'mouse=a', ie special mouse handling was active all the time. I've run into mouse handling limits in vim before, although things may have changed since then.)

In theory, as covered in Vim's X11 selection mechanism, I might be able to paste from another xterm (or whatever) using "*p (to use the '"*' register, which is the primary selection or the cut buffer if there's no primary selection). In practice I think this only works under limited circumstances (although I'm not sure what they are) and the Vim manual itself tells you to get used to using Shift with your middle mouse button. I would rather set paste mode, because that gets everything; a vim that has the mouse active probably has other things I don't want turned on too.

(Some day I'll put together a complete but minimal collection of vim settings to disable everything I want disabled, but that day isn't today.)

PS: If I'm reading various things correctly, I think vim has to be built with the 'xterm_clipboard' option in order to pull out selection information from xterm. Xterm itself must have 'Window Ops' allowed, which is not a normal setting; with this turned on, vim (or any other program) can use the selection manipulation escape sequences that xterm documents in "Operating System Commands". These escape sequences don't require that vim have direct access to your X display, so they can be used over plain SSH connections. Support for these escape sequences is probably available in other terminal emulators too, and these terminal emulators may have them always enabled.

(Note that access to your selection is a potential security risk, which is probably part of why xterm doesn't allow it by default.)

(3 comments.)

WireGuard on OpenBSD just works (at least as a VPN server)

Chris's Wiki :: blog

By: cks

27 December 2024 at 04:12

A year or so ago I mentioned that I'd set up WireGuard on an Android and an iOS device in a straightforward VPN configuration. What I didn't mention in that entry is that the other end of the VPN was not on a Linux machine, but on one of our OpenBSD VPN servers. At the time it was running whatever was the then-current OpenBSD version, and today it's running OpenBSD 7.6, which is the current version at the moment. Over that time (and before it, since the smartphones weren't its first WireGuard clients), WireGuard on OpenBSD has been trouble free and has just worked.

In our configuration, OpenBSD WireGuard requires installing the 'wireguard-tools' package, setting up an /etc/wireguard/wg0.conf (perhaps plus additional files for generated keys), and creating an appropriate /etc/hostname.wg0. I believe that all of these are covered as part of the standard OpenBSD documentation for setting up WireGuard. For this VPN server I allocated a /24 inside the RFC 1918 range we use for VPN service to be used for WireGuard, since I don't expect too many clients on this server. The server NATs WireGuard connections just as it NATs connections from the other VPNs it supports, which requires nothing special for WireGuard in its /etc/pf.conf.

(I did have to remember to allow incoming traffic to the WireGuard UDP port. For this server, we allow WireGuard clients to send traffic to each other through the VPN server if they really want to, but in another one we might want to restrict that with additional pf rules.)

Everything I'd expect to work does work, both in terms of the WireGuard tools (I believe the information 'wg' prints is identical between Linux and OpenBSD, for example) and for basic system metrics (as read out by, for example, the OpenBSD version of the Prometheus host agent, which has overall metrics for the 'wg0' interface). If we wanted per-client statistics, I believe we could probably get them through this third party WireGuard Prometheus exporter, which uses an underlying package to talk to WireGuard that does apparently work on OpenBSD (although this particular exporter can potentially have label cardinality issues), or generate them ourselves by parsing 'wg' output (likely from 'wg show all dump').

This particular OpenBSD VPN server is sufficiently low usage that I haven't tried to measure either the possible bandwidth we can achieve with WireGuard or the CPU usage of WireGuard. Historically, neither are particularly critical for our VPNs in general, which have generally not been capable of particularly high bandwidth (with either OpenVPN or L2TP, our two general usage VPN types so far; our WireGuard VPN is for system staff only).

(In an ideal world, none of this should count as surprising. In this world, I like to note when things that are a bit out of the mainstream just work for me, with a straightforward setup process and trouble free operation.)

(3 comments.)

Unix's buffered IO in assembly and in C

Chris's Wiki :: blog

By: cks

9 December 2024 at 02:44

Recently on the Fediverse, I said something related to Unix's pre-V7 situation with buffered IO:

[...]

(I think the V1 approach is right for an assembly based minimal OS, while the stdio approach kind of wants malloc() and friends.)

The V1 approach, as documented in its putc.3 and getw.3 manual pages, is that the caller to the buffered IO routines supplies the data area used for buffering, and the library functions merely initialize it and later use it. How you get the data area is up to you and your program; you might, for example, simply have a static block of memory in your BSS segment. You can dynamically allocate this area if you want to, but you don't have to. The V2 and later putchar have a similar approach but this time they contain a static buffer area and you just have to do a bit of initialization (possibly putchar was in V1 too, I don't know for sure).

Stdio of course has a completely different API. In stdio, you don't provide the data area; instead, stdio provides you an opaque reference (a 'FILE *') to the information and buffers it maintains internally. This is an interface that definitely wants some degree of dynamic memory allocation, for example for the actual buffers themselves, and in modern usage most of the FILE objects will be dynamically allocated too.

(The V7 stdio implementation had a fixed set of FILE structs and so would error out if you used too many of them. However, it did use malloc() for the buffer associated with them, in filbuf.c and flsbuf.c.)

You can certainly do dynamic memory allocation in assembly, but I think it's much more natural in C, and certainly the C standard library is more heavyweight than the relatively small and minimal assembly language stuff early Unix programs (written in assembly) seem to have required. So I think it makes a lot of sense that Unix started with a buffering approach where the caller supplies the buffer (and probably doesn't dynamically allocate it), then moved to one where the library does at least some allocation and supplies the buffer (and other data) itself.

Buffered IO in Unix before V7 introduced stdio

Chris's Wiki :: blog

By: cks

6 December 2024 at 04:16

I recently read Julia Evans' Why pipes sometimes get "stuck": buffering. Part of the reason is that almost every Unix program does some amount of buffering for what it prints (or writes) to standard output and standard error. For C programs, this buffering is built into the standard library, specifically into stdio, which includes familiar functions like printf(). Stdio is one of the many things that appeared first in Research Unix V7. This might leave you wondering if this sort of IO was buffered in earlier versions of Research Unix and if it was, how it was done.

The very earliest version of Research Unix is V1, and in V1 there is putc.3 (at that point entirely about assembly, since C was yet to come). This set of routines allows you to set up and then use a 'struct' to implement IO buffering for output. There is a similar set of buffered functions for input, in getw.3, and I believe the memory blocks the two sets of functions use are compatible with each other. The V1 manual pages note it as a bug that the buffer wasn't 512 bytes, but also notes that several programs would break if the size was changed; the buffer size will be increased to 512 bytes by V3.

In V2, I believe we still have putc and getw, but we see the first appearance of another approach, in putchr.s. This implements putchar(), which is used by printf() and which (from later evidence) uses an internal buffer (under some circumstances) that has to be explicitly flush()'d by programs. In V3, there's manual pages for putc.3 and getc.3 that are very similar to the V1 versions, which is why I expect these were there in V2 as well. In V4, we have manual pages for both putc.3 (plus getc.3) and putch[a]r.3, and there is also a getch[a]r.3 that's the input version of putchar(). Since we have a V4 manual page for putchar(), we can finally see the somewhat tangled way it works, rather than having to read the PDP-11 assembly. I don't have links to V5 manuals, but the V5 library source says that we still have both approaches to buffered IO.

(If you want to see how the putchar() approach was used, you can look at, for example, the V6 grep.c, which starts out with the 'fout = dup(1);' that the manual page suggests for buffered putchar() usage, and then periodically calls flush().)

In V6, there is a third approach that was added, in /usr/source/iolib, although I don't know if any programs used it. Iolib has a global array of structs, that were statically associated with a limited number of low-numbered file descriptors; an iolib function such as cflush() would be passed a file descriptor and use that to look up the corresponding struct. One innovation iolib implicitly adds is that its copen() effectively 'allocates' the struct for you, in contrast to putc() and getc(), where you supply the memory area and fopen()/fcreate() merely initialize it with the correct information.

Finally V7 introduces stdio and sorts all of this out, at the cost of some code changes. There's still getc() and putc(), but now they take a FILE *, instead of their own structure, and you get the FILE * from things like fopen() instead of supplying it yourself and having a stdio function initialize it. Putchar() (and getchar()) still exist but are now redone to work with stdio buffering instead of their own buffering, and 'flush()' has become fflush() and takes an explicit FILE * argument instead of implicitly flushing putchar()'s buffer, and generally it's not necessary any more. The V7 grep.c still uses printf(), but now it doesn't explicitly flush anything by calling fflush(); it just trusts in stdio.

Complications in supporting 'append to a file' in a NFS server

Chris's Wiki :: blog

By: cks

8 November 2024 at 04:14

In the comments of my entry on the general problem of losing network based locks, an interesting side discussion has happened between commentator abel and me over NFS servers (not) supporting the Unix O_APPEND feature. The more I think about it, the more I think it's non-trivial to support well in an NFS server and that there are some subtle complications (and probably more than I haven't realized). I'm mostly going to restrict this to something like NFS v3, which is what I'm familiar with.

The basic Unix semantics of O_APPEND are that when you perform a write(), all of your data is immediately and atomically put at the current end of the file, and the file's size and maximum offset are immediately extended to the end of your data. If you and I do a single append write() of 128 Mbytes to the same file at the same time, either all of my 128 Mbytes winds up before yours or vice versa; your and my data will never wind up intermingled.

This basic semantics is already a problem for NFS because NFS (v3) connections have a maximum size for single NFS 'write' operations and that size may be (much) smaller than the user level write(). Without a multi-operation transaction of some sort, we can't reliably perform append write()s of more data than will fit in a NFS write operation; either we fail those 128 Mbyte writes, or we have the possibility that data from you and I will be intermingled in the file.

In NFS v2, all writes were synchronous (or were supposed to be, servers sometimes lied about this). NFS v3 introduced the idea of asynchronous, buffered writes that were later committed by clients. NFS servers are normally permitted to discard asynchronous writes that haven't yet been committed by the client; when the client tries to commit them later, the NFS server rejects the commit and the client resends the data. This works fine when the client's request has a definite position in the file, but it has issues if the client's request is a position-less append write. If two clients do append writes to the same file, first A and then B after it, the server discards both, and then client B is the first one to go through the 'COMMIT, fail, resend' process, where does its data wind up? It's not hard to wind up with situations where a third client that's repeatedly reading the file will see inconsistent results, where first it sees A's data then B's and then later either it sees B's data before A's or B's data without anything from A (not even a zero-filled gap in the file, the way you'd get with ordinary writes).

(While we can say that NFS servers shouldn't ever deliberately discard append writes, one of the ways that this happens is that the server crashes and reboots.)

You can get even more fun ordering issues created by retrying lost writes if there is another NFS client involved that is doing manual append writes by finding out the current end of file and writing at it. If A and B do append writes, C does a manual append write, all writes are lost before they're committed, B redoes, C redoes, and then A redoes, a natural implementation could easily wind up with B's data, an A data sized hole, C's data, and then A's data appended after C's.

This also creates server side ordering dependencies for potentially discarding uncommitted asynchronous write data, ones that a NFS server can normally make independently. If A appended a lot of data and then B appended a little bit, you probably don't want to discard A's data but not B's, because there's no guarantee that A will later show up to fail a COMMIT and resend it (A could have crashed, for example). And if B requests a COMMIT, you probably want to commit A's data as well, even if there's much more of it.

One way around this would be to adopt a more complex model of append writes over NFS, where instead of the client requesting an append write, it requests 'write this here but fail if this is not the current end of file'. This would give all NFS writes a definite position in the file at the cost of forcing client retries on the initial request (if the client later has to repeat the write because of a failed commit, it must carefully strip this flag off). Unfortunately a file being appended to from multiple clients at a high rate would probably result in a lot of client retries, with no guarantee that a given client would ever actually succeed.

(You could require all append writes to be synchronous, but then this would do terrible things to NFS server performance for potentially common use of append writes, like appending log lines to a shared log file from multiple machines. And people absolutely would write and operate programs like that if append writes over NFS were theoretically reliable.)

(One comment.)

Losing NFS locks and the SunOS SIGLOST signal

Chris's Wiki :: blog

By: cks

7 November 2024 at 02:48

NFS is a network filesystem that famously also has a network locking protocol associated with it (or part of it, for NFSv4). This means that NFS has to consider the issue of the NFS client losing a lock that it thinks it holds. In NFS, clients losing locks normally happens as part of NFS(v3) lock recovery, triggered when a NFS server reboots. On server reboot, clients are told to re-acquire all of their locks, and this re-acquisition can explicitly fail (as well as going wrong in various ways that are one way to get stuck NFS locks). When a NFS client's kernel attempts to reclaim a lock and this attempt fails, it has a problem. Some process on the local machine thinks that it holds a (NFS) lock, but as far as the NFS server and other NFS clients are concerned, it doesn't.

Sun's original version of NFS dealt with this problem with a special signal, SIGLOST. When the NFS client's kernel detected that a NFS lock had been lost, it sent SIGLOST to whatever process held the lock. SIGLOST was a regular signal, so by default the process would exit abruptly; a process that wanted to do something special could register a signal handler for SIGLOST and then do whatever it could. SIGLOST appeared no later than SunOS 3.4 (cf) and still lives on today in Illumos, where you can find this discussed in uts/common/klm/nlm_client.c and uts/common/fs/nfs/nfs4_recovery.c (and it's also mentioned in fcntl(2)). The popularity of actually handling SIGLOST may be indicated by the fact that no program in the Illumos source tree seems to set a signal handler for it.

Other versions of Unix mainly ignore the situation. The Linux kernel has a specific comment about this in fs/lockd/clntproc.c, which very briefly talks about the issue and picks ignoring it (apart from logging the kernel message "lockd: failed to reclaim lock for ..."). As far as I can tell from reading FreeBSD's sys/nlm/nlm_advlock.c, FreeBSD silently ignores any problems when it goes through the NFS client process of reclaiming locks.

(As far as I can see, NetBSD and OpenBSD don't support NFS locks on clients at all, rendering the issue moot. I don't know if POSIX locks fail on NFS mounted filesystems or if they work but create purely local locks on that particular NFS client, although I think it's the latter.)

On the surface this seems rather bad, and certainly worse than the Sun approach of SIGLOST. However, I'm not sure that SIGLOST is all that great either, because it has some problems. First, what you can do in a signal handler is very constrained; basically all that a SIGLOST handler can do is set a variable and hope that the rest of the code will check it before it does anything dangerous. Second, programs may hold multiple (NFS) locks and SIGLOST doesn't tell you which lock you lost; as far as I know, there's no way of telling. If your program gets a SIGLOST, all you can do is assume that you lost all of your locks. Third, file locking may quite reasonably be used inside libraries in a way that is hidden from callers by the library's API, but signals and handling signals is global to the entire program. If taking a file lock inside a library exposes the entire program to SIGLOST, you have a collection of problems (which ones depend on whether the program has its own file locks and whether or not it has installed a SIGLOST handler).

This collection of problems may go part of the way to explain why no Illumos programs actually set a SIGLOST handler and why other Unixes simply ignore the issue. A kernel that uses SIGLOST essentially means 'your program dies if it loses a lock', and it's not clear that this is better than 'your program optimistically continues', especially in an environment where a NFS client losing a NFS lock is rare (and letting the program continue is certainly simpler for the kernel).

(One comment.)

The history of Unix's ioctl and signal about window sizes

Chris's Wiki :: blog

By: cks

4 November 2024 at 03:38

One of the somewhat obscure features of Unix is that the kernel has a specific interface to get (and set) the 'window size' of your terminal, and can also send a Unix signal to your process when that size changes. The official POSIX interface for the former is tcgetwinsize(), but in practice actual Unixes have a standard tty ioctl for this, TIOCGWINSZ (see eg Linux ioctl_tty(2) (also) or FreeBSD tty(4)). The signal is officially standardized by POSIX as SIGWINCH, which is the name it always has had. Due to a Fediverse conversation, I looked into the history of this today, and it turns out to be more interesting than I expected.

(The inclusion of these interfaces in POSIX turns out to be fairly recent.)

As far as I can tell, 4.2 BSD did not have either TIOCGWINSZ or SIGWINCH (based on its sigvec(2) and tty(4) manual pages). Both of these appear in the main BSD line in 4.3 BSD, where sigvec(2) has added SIGWINCH (as the first new signal along with some others) and tty(4) has TIOCGWINSZ. This timing makes a certain amount of sense in Unix history. At the time of 4.2 BSD's development and release, people were connecting to Unix systems using serial terminals, which had more or less fixed sizes that were covered by termcap's basic size information. By the time of 4.3 BSD in 1986, Unix workstations existed and with them, terminal windows that could have their size changed on the fly; a way of finding out (and changing) this size was an obvious need, along with a way for full-screen programs like vi to get notified if their terminal window was resized on the fly.

However, as far as I can tell 4.3 BSD itself did not originate SIGWINCH, although it may be the source of TIOCGWINSZ. The FreeBSD project has manual pages for a variety of Unixes, including 'Sun OS 0.4', which seems to be an extremely early release from early 1983. This release has a signal(2) with a SIGWINCH signal (using signal number 28, which is what 4.3 BSD will use for it), but no (documented) TIOCGWINSZ. However, it does have some programs that generate custom $TERMCAP values with the right current window sizes.

The Internet Archives has a variety of historical material from Sun Microsystems, including (some) documentation for both SunOS 2.0 and SunOS 3.0. This documentation makes it clear that the primary purpose of SIGWINCH was to tell graphical programs that their window (or one of them) had been changed, and they should repaint the window or otherwise refresh the contents (a program with multiple windows didn't get any indication of which window was damaged; the programming advice is to repaint them all). The SunOS 2.0 tgetent() termcap function will specifically update what it gives you with the current size of your window, but as far as I can tell there's no other documented support of getting window sizes; it's not mentioned in tty(4) or pty(4). Similar wording appears in the SunOS 3.0 Unix Interface Reference Manual.

(There are PDFs of some SunOS documentation online (eg), and up through SunOS 3.5 I can't find any mention of directly getting the 'window size'. In SunOS 4.0, we finally get a TIOCGWINSZ, documented in termio(4). However, I have access to SunOS 3.5 source, and it does have a TIOCGWINSZ ioctl, although that ioctl isn't documented. It's entirely likely that TIOCGWINSZ was added (well) before SunOS 3.5.)

According to this Git version of the original BSD development history, BSD itself added both SIGWINCH and TIOCGWINSZ at the end of 1984. The early SunOS had SIGWINCH and it may well have had TIOCGWINSZ as well, so it's possible that BSD got both from SunOS. It's also possible that early SunOS had a different (terminal) window size mechanism than TIOCGWINSZ, one more specific to their window system, and the UCB CSRG decided to create a more general mechanism that Sun then copied back by the time of SunOS 3.5 (possibly before the official release of 4.3 BSD, since I suspect everyone in the BSD world was talking to each other at that time).

PS: SunOS also appears to be the source of the mysteriously missing signal 29 in 4.3 BSD (mentioned in my entry on how old various Unix signals are). As described in the SunOS 3.4 sigvec() manual page, signal 29 is 'SIGLOST', "resource lost (see lockd(8C))". This appears to have been added at some point between the initial SunOS 3.0 release and SunOS 3.4, but I don't know exactly when.

(One comment.)

Notes on the compatibility of crypted passwords across Unixes in late 2024

Chris's Wiki :: blog

By: cks

2 November 2024 at 02:31

For years now, all sorts of Unixes have been able to support better password 'encryption' schemes than the basic old crypt(3) salted-mutant-DES approach that Unix started with (these days it's usually called 'password hashing'). However, the support for specific alternate schemes varies from Unix to Unix, and has for many years. Back in 2010 I wrote some notes on the situation at the time; today I want to look at the situation again, since password hashing is on my mind right now.

The most useful resource for cross-Unix password hash compatibility is Wikipedia's comparison table. For Linux, support varies by distribution based on their choice of C library and what version of libxcrypt they use, and you can usually see a list in crypt(5), and pam_unix may not support using all of them for new passwords. For FreeBSD, their support is documented in crypt(3). In OpenBSD, this is documented in crypt(3) and crypt_newhash(3), although there isn't much to read since current OpenBSD only lists support for 'Blowfish', which for password hashing is also known as bcrypt. On Illumos, things are more or less documented in crypt(3), crypt.conf(5), and crypt_unix(7) and associated manual pages; the Illumos section 7 index provides one way to see what seems to be supported.

System administrators not infrequently wind up wanting cross-Unix compatibility of their local encrypted passwords. If you don't care about your shared passwords working on OpenBSD (or NetBSD), then the 'sha512' scheme is you best bet; it basically works everywhere these days. If you do need to include OpenBSD or NetBSD, you're stuck with bcrypt but even then there may be problems because bcrypt is actually several schemes, as Wikipedia covers.

Some recent Linux distributions seem to be switching to 'yescrypt' by default (including Debian, which means downstream distributions like Ubuntu have also switched). Yescrypt in Ubuntu is now old enough that it's probably safe to use in an all-Ubuntu environment, although your distance may vary if you have 18.04 or earlier systems. Yescrypt is not yet available in FreeBSD and may never be added to OpenBSD or NetBSD (my impression is that OpenBSD is not a fan of having lots of different password hashing algorithms and prefers to focus on one that they consider secure).

(Compared to my old entry, I no longer particularly care about the non-free Unixes, including macOS. Even Wikipedia doesn't bother trying to cover AIX. For our local situation, we may someday want to share passwords to FreeBSD machines, but we're very unlikely to care about sharing passwords to OpenBSD machines since we currently only use them in situations where having their own stand-alone passwords is a feature, not a bug.)

(One comment.)

Doing basic policy based routing on FreeBSD with PF rules

Chris's Wiki :: blog

By: cks

24 October 2024 at 03:26

Suppose, not hypothetically, that you have a FreeBSD machine that has two interfaces and these two interfaces are reached through different firewalls. You would like to ping both of the interfaces from your monitoring server because both of them matter for the machine's proper operation, but to make this work you need replies to your pings to be routed out the right interface on the FreeBSD machine. This is broadly known as policy based routing and is often complicated to set up. Fortunately FreeBSD's version of PF supports a basic version of this, although it's not well explained in the FreeBSD pf.conf manual page.

To make our FreeBSD machine reply properly to our monitoring machine's ICMP pings, or in general to its traffic, we need a stateful 'pass' rule with a 'reply-to':

B_IF="emX"
B_IP="10.x.x.x"
B_GW="10.x.x.254"
B_SUBNET="10.x.x.0/24"

pass in quick on $B_IF \
  reply-to ($B_IF $B_GW) \
  inet from ! $B_SUBNET to $B_IP \
  keep state

(Here $B_IP is the machine's IP on this second interface, and we also need the second interface, the gateway for the second interface's subnet, and the subnet itself.)

As I discovered, you must put the 'reply-to' where it is here, although as far as I can tell the FreeBSD pf.conf manual page will only tell you that if you read the full BNF. If you put it at the end the way you might read the text description, you will get only opaque syntax errors.

We must specifically exclude traffic from the subnet itself to us, because otherwise this rule will faithfully send replies to other machines on the same subnet off to the gateway, which either won't work well or won't work at all. You can restrict the PF rule more narrowly, for example 'from { IP1 IP2 IP3 }' if those are the only off-subnet IPs that are supposed to be talking to your secondary interface.

(You may also want to match only some ports here, unless you want to give all incoming traffic on that interface the ability to talk to everything on the machine. This may require several versions of this rule, basically sticking the 'reply-to ...' bit into every 'pass in quick on ...' rule you have for that interface.)

This PF rule only handles incoming connections (including implicit ones from ICMP and UDP traffic). If we want to be able to route our outgoing traffic over our secondary interface by selecting a source address when you do things, we need a second PF rule:

pass out quick \
  route-to ($B_IF $B_GW) \
  inet from $B_IP to ! $B_SUBNET \
  keep state

Again we must specifically exclude traffic to our local network, because otherwise it will go flying off to our gateway, and also you can be more specific if you only want this machine to be able to connect to certain things using this gateway and firewall (eg 'to { IP1 IP2 SUBNET3/24 }', or you could use a port-based restriction).

(The PF rule can't be qualified with 'on $B_IF', because the situation where you need this rule is where the packet would not normally be going out that interface. Using 'on <the interface with your default route's gateway>' has some subtle differences in the semantics if you have more than two interfaces.)

Although you might innocently think otherwise, the second rule by itself isn't sufficient to make incoming connections to the second interface work correctly. If you want both incoming and outgoing connections to work, you need both rules. Possibly it would work if you matched incoming traffic on $B_IF without keeping state.

The history of inetd is more interesting than I expected

Chris's Wiki :: blog

By: cks

10 October 2024 at 03:11

Inetd is a traditional Unix 'super-server' that listens on multiple (IP) ports and runs programs in response to activity on them. When inetd listens on a port, it can act in two different modes. In the simplest mode, it starts a separate copy of the configured program for every connection (much like the traditional HTTP CGI model), which is an easy way to implement small, low volume services but usually not good for bigger, higher volume ones. The second mode is more like modern 'socket activation'; when a connection comes in, inetd starts your program and passes it the master socket, leaving it to you to keep accepting and processing connections until you exit.

(In inetd terminology, the first mode is 'nowait' and the second is 'wait'; this describes whether inetd immediate resumes listening on the socket for connections or waits until the program exits.)

Inetd turns out to have a more interesting history than I expected, and it's a history that's entwined with daemonization, especially with how the BSD r* commands daemonize themselves in 4.2 BSD. If you'd asked me before I started writing this entry, I'd have said that inetd was present in 4.2 BSD and was being used for various low-importance services. This turns out to be false in both respects. As far as I can tell, inetd was introduced in 4.3 BSD, and when it was introduced it was immediately put to use for important system daemons like rlogind, telnetd, ftpd, and so on, which were surprisingly run in the first style (with a copy of the relevant program started for each connection). You can see this in the 4.3 BSD /etc/inetd.conf, which has the various TCP daemons and lists them as 'nowait'.

(There are still network programs that are run as stand-alone daemons, per the 4.3 BSD /etc/rc and the 4.3 BSD /etc/rc.local. If we don't count syslogd, the standard 4.3 BSD tally seems to be rwhod, lpd, named, and sendmail.)

While I described inetd as having two modes and this is the modern state, the 4.3 BSD inetd(8) manual page says that only the 'start a copy of the program every time' mode ('nowait') is to be used for TCP programs like rlogind. I took a quick read over the 4.3 BSD inetd.c and it doesn't seem to outright reject a TCP service set up with 'wait', and the code looks like it might actually work with that. However, there's the warning in the manual page and there's no inetd.conf entry for a TCP service that is 'wait', so you'd be on your own.

The corollary of this is that in 4.3 BSD, programs like rlogind don't have the daemonization code that they did in 4.2 BSD. Instead, the 4.3 BSD rlogind.c shows that it can only be run under inetd or some equivalent, as rlogind immediately aborts if its standard input isn't a socket (and it expects the socket to be connected to some other end, which is true for the 'nowait' inetd mode but not how things would be for the 'wait' mode).

This 4.3 BSD inetd model seems to have rapidly propagated into BSD-derived systems like SunOS and Ultrix. I found traces that relatively early on, both of them had inherited the 4.3 style non-daemonizing rlogind and associated programs, along with an inetd-based setup for them. This is especially interesting for SunOS, because it was initially derived from 4.2 BSD (I'm less sure of Ultrix's origins, although I suspect it too started out as 4.2 BSD derived).

PS: I haven't looked to see if the various BSDs ever changed this mode of operation for rlogind et al, or if they carried the 'per connection' inetd based model all through until each of them removed the r* commands entirely.

(One comment.)

OpenBSD kernel messages about memory conflicts on x86 machines

Chris's Wiki :: blog

By: cks

9 October 2024 at 02:44

Suppose you boot up an OpenBSD machine that you think may be having problems, and as part of this boot you look at the kernel messages for the first time in a while (or perhaps ever), and when doing so you see messages that look like this:

3:0:0: rom address conflict 0xfffc0000/0x40000
3:0:1: rom address conflict 0xfffc0000/0x40000

Or maybe the messages are like this:

memory map conflict 0xe00fd000/0x1000
memory map conflict 0xfe000000/0x11000
[...]
3:0:0: mem address conflict 0xfffc0000/0x40000
3:0:1: mem address conflict 0xfffc0000/0x40000

This sounds alarming, but there's almost certainly no actual problem, and if you check logs you'll likely find that you've been getting messages like this for as long as you've had OpenBSD on the machine.

The short version is that both of these are reports from OpenBSD that it's finding conflicts in the memory map information it is getting from your BIOS. The messages that start with 'X:Y:Z' are about PCI(e) device memory specifically, while the 'memory map conflict' errors are about the general memory map the BIOS hands the system.

Generally, OpenBSD will report additional information immediately after about what the PCI(e) devices in question are. Here are the full kernel messages around the 'rom address conflict':

pci3 at ppb2 bus 3
3:0:0: rom address conflict 0xfffc0000/0x40000
3:0:1: rom address conflict 0xfffc0000/0x40000
bge0 at pci3 dev 0 function 0 "Broadcom BCM5720" rev 0x00, BCM5720 A0 (0x5720000), APE firmware NCSI 1.4.14.0: msi, address 50:9a:4c:xx:xx:xx
brgphy0 at bge0 phy 1: BCM5720C 10/100/1000baseT PHY, rev. 0
bge1 at pci3 dev 0 function 1 "Broadcom BCM5720" rev 0x00, BCM5720 A0 (0x5720000), APE firmware NCSI 1.4.14.0: msi, address 50:9a:4c:xx:xx:xx
brgphy1 at bge1 phy 2: BCM5720C 10/100/1000baseT PHY, rev. 0

Here these are two network ports on the same PCIe device (more or less), so it's not terribly surprising that the same ROM is maybe being reused for both. I believe the two messages mean that both ROMs (at the same address) are conflicting with another unmentioned allocation. I'm not sure how you find out what the original allocation and device is that they're both conflicting with.

The PCI related messages come from sys/dev/pci/pci.c and in current OpenBSD come in a number of variations, depending on what sort of PCI address space is detected as in conflict in pci_reserve_resources(). Right now, I see 'mem address conflict', 'io address conflict', the already mentioned 'rom address conflict', 'bridge io address conflict', 'bridge mem address conflict' (in several spots in the code), and 'bridge bus conflict'. Interested parties can read the source for more because this exhausts my knowledge on the subject.

The 'memory map conflict' message comes from a different place; for most people it will come from sys/arch/amd64/pci/pci_machdep.c, in pci_init_extents(). If I'm understanding the code correctly, this is creating an initial set of reserved physical address space that PCI devices should not be using. It registers each piece of bios_memmap, which according to comments in sys/arch/amd64/amd64/machdep.c is "the memory map as the bios has returned it to us". I believe that a memory map conflict at this point says that two pieces of the BIOS memory map overlap each other (or one is entirely contained in the other).

I'm not sure it's correct to describe these messages as harmless. However, it's likely that they've been there for as long as your system's BIOS has been setting up its general memory map and the PCI devices as it has been, and you'd likely see the same address conflicts with another system (although Linux doesn't seem to complain about it; I don't know about FreeBSD).

(One comment.)

Daemonization in Unix programs is probably about restarting programs

Chris's Wiki :: blog

By: cks

6 October 2024 at 02:55

It's standard for Unix daemon programs to 'daemonize' themselves when they start, completely detaching from how they were run; this behavior is quite old and these days it's somewhat controversial and sometimes considered undesirable. At this point you might ask why programs even daemonize themselves in the first place, and while I don't know for sure, I do have an opinion. My belief is that daemonization is because of restarting daemon programs, not starting them at boot.

During system boot, programs don't need to daemonize in order to start properly. The general Unix boot time environment has long been able to detach programs into the background (although the V7 /etc/rc didn't bother to do this with /etc/update and /etc/cron, the 4.2BSD /etc/rc did do this for the new BSD network daemons). In general, programs started at boot time don't need to worry that they will be inheriting things like stray file descriptors or a controlling terminal. It's the job of the overall boot time environment to insure that they start in a clean environment, and if there's a problem there you should fix it centrally, not make it every program's job to deal with the failure of your init and boot sequence.

However, init is not a service manager (not historically), which meant that for a long time, starting or restarting daemons after boot was entirely in your hands with no assistance from the system. Even if you remembered to restart a program as 'daemon &' so that it was backgrounded, the newly started program could inherit all sorts of things from your login session. It might have some random current directory, it might have stray file descriptors that were inherited from your shell or login environment, its standard input, output, and error would be connected to your terminal, and it would have a controlling terminal, leaving it exposed to various bad things happening to it when, for example, you logged out (which often would deliver a SIGHUP to it).

This is the sort of thing that even very old daemonization code deals with, which is to say that it fixes. The 4.2BSD daemonization code closes (stray) file descriptors and removes any controlling terminal the process may have, in addition to detaching itself from your shell (in case you forgot or didn't use the '&' when starting it). It's also easy to see how people writing Unix daemons might drift into adding this sort of code to them as people restarted the daemons (by hand) and ran into the various problems (cf). In fact the 4.2BSD code for it is conditional on 'DEBUG' not being defined; presumably if you were debugging, say, rlogind, you'd build a version that didn't detach itself on you so you could easily run it under a debugger or whatever.

It's a bit of a pity that 4.2 BSD and its successors didn't create a general 'daemonize' program that did all of this for you and then told people to restart daemons with 'daemonize <program>' instead of '<program>'. But we got the Unix that we have, not the Unix that we'd like to have, and Unixes did eventually grow various forms of service management that tried to encapsulate all of the things required to restart daemons in one place.

(Even then, I'm not sure that old System V init systems would properly daemonize something that you restarted through '/etc/init.d/<whatever> restart', or if it was up to the program to do things like close extra file descriptors and get rid of any controlling terminal.)

PS: Much later, people did write tools for this, such as daemonize. It's surprisingly handy to have such a program lying around for when you want or need it.

(2 comments.)

Traditionally, init on Unix was not a service manager as such

Chris's Wiki :: blog

By: cks

5 October 2024 at 03:05

Init (the process) has historically had a number of roles but, perhaps surprisingly, being a 'service manager' (or a 'daemon manager') was not one of them in traditional init systems. In V7 Unix and continuing on into traditional 4.x BSD, init (sort of) started various daemons by running /etc/rc, but its only 'supervision' was of getty processes for the console and (other) serial lines. There was no supervision or management of daemons or services, even in the overall init system (stretching beyond PID 1, init itself). To restart a service, you killed its process and then re-ran it somehow; getting even the command line arguments right was up to you.

(It's conventional to say that init started daemons during boot, even though technically there are some intermediate processes involved since /etc/rc is a shell script.)

The System V init had a more general /etc/inittab that could in theory handle more than getty processes, but in practice it wasn't used for managing anything more than them. The System V init system as a whole did have a concept of managing daemons and services, in the form of its multi-file /etc/rc.d structure, but stopping and restarting services was handled outside of the PID 1 init itself. To stop a service you directly ran its init.d script with 'whatever stop', and the script used various approaches to find the processes and get them to stop. Similarly, (re)starting a daemon was done directly by its init.d script, without PID 1 being involved.

As a whole system the overall System V init system was a significant improvement on the more basic BSD approach, but it (still) didn't have init itself doing any service supervision. In fact there was nothing that actively did service supervision even in the System V model. I'm not sure what the first system to do active service supervision was, but it may have been daemontools. Extending the init process itself to do daemon supervision has a somewhat controversial history; there are Unix systems that don't do this through PID 1, although doing a good job of it has clearly become one of the major jobs of the init system as a whole.

That init itself didn't do service or daemon management is, in my view, connected to the history of (process) daemonization. But that's another entry.

(There's also my entry on how init (and the init system as a whole) wound up as Unix's daemon manager.)

(2 comments.)

(Unix) daemonization turns out to be quite old

Chris's Wiki :: blog

By: cks

4 October 2024 at 02:51

In the Unix context, 'daemonization' means a program that totally detaches itself from how it was started. It was once very common and popular, but with modern init systems they're often no longer considered to be all that good an idea. I have some views on the history here, but today I'm going to confine myself to a much smaller subject, which is that in Unix, daemonization goes back much further than I expected. Some form of daemonization dates to Research Unix V5 or earlier, and an almost complete version appears in network daemons in 4.2 BSD.

As far back as Research Unix V5 (from 1974), /etc/rc is starting /etc/update (which does a periodic sync()) without explicitly backgrounding it. This is the giveaway sign that 'update' itself forks and exits in the parent, the initial version of daemonization, and indeed that's what we find in update.s (it wasn't yet a C program). The V6 update is still in assembler, but now the V6 update.s is clearly not just forking but also closing file descriptors 0, 1, and 2.

In the V7 /etc/rc, the new /etc/cron is also started without being explicitly put into the background. The V7 update.c seems to be a straight translation into C, but the V7 cron.d has a more elaborate version of daemonization. V7 cron forks, chdir's to /, does some odd things with standard input, output, and error, ignores some signals, and then starts doing cron things. This is pretty close to what you'd do in modern daemonization.

The first 'network daemons' appeared around the time of 4.2 BSD. The 4.2BSD /etc/rc explicitly backgrounds all of the r* daemons when it starts them, which in theory means they could have skipped having any daemonization code. In practice, rlogind.c, rshd.c, rexecd.c, and rwhod.c all have essentially identical code to do daemonization. The rlogind.c version is:

#ifndef DEBUG
	if (fork())
		exit(0);
	for (f = 0; f < 10; f++)
		(void) close(f);
	(void) open("/", 0);
	(void) dup2(0, 1);
	(void) dup2(0, 2);
	{ int tt = open("/dev/tty", 2);
	  if (tt > 0) {
		ioctl(tt, TIOCNOTTY, 0);
		close(tt);
	  }
	}
#endif

This forks with the parent exiting (detaching the child from the process hierarchy), then the child closes any (low-numbered) file descriptors it may have inherited, sets up non-working standard input, output, and error, and detaches itself from any controlling terminal before starting to do rlogind's real work. This is pretty close to the modern version of daemonization.

(Today, the ioctl() stuff is done by calling setsid() and you'd probably want to close more than the first ten file descriptors, although that's still a non-trivial problem.)

(2 comments.)

Old (Unix) workstations and servers tended to boot in the same ways

Chris's Wiki :: blog

By: cks

23 September 2024 at 02:50

I somewhat recently read j. b. crawford's ipmi, where in a part crawford talks about how old servers of the late 80s and 90s (Unix and otherwise) often had various features for management like serial consoles. What makes something an old school 80s and 90s Unix server and why they died off is an interesting topic I have views on, but today I want to mention and cover a much smaller one, which is that this sort of early boot environment and low level management system was generally also found on Unix workstations.

By and large, the various companies making both Unix servers and Unix workstations, such as Sun, SGI, and DEC, all used the same boot time system firmware on both workstation models and server models (presumably partly because that was usually easier and cheaper). Since most workstations also had serial ports, the general consequence of this was that you could set up a 'workstation' with a serial console if you wanted to. Some companies even sold the same core hardware as either a server or workstation depending on what additional options you put in it (and with appropriate additional hardware you could convert an old server into a relatively powerful workstation).

(The line between 'workstation' and 'server' was especially fuzzy for SGI hardware, where high end systems could be physically big enough to be found in definite server-sized boxes. Whether you considered these 'servers with very expensive graphics boards' or 'big workstations' could be a matter of perspective and how they were used.)

As far as the firmware was concerned, generally what distinguished a 'server' that would talk to its serial port to control booting and so on from a 'workstation' that had a graphical console of some sort was the presence of (working) graphics hardware. If the firmware saw a graphics board and no PROM boot variables had been set, it would assume the machine was a workstation; if there was no graphics hardware, you were a server.

As a side note, back in those days 'server' models were not necessarily rack-mountable and weren't always designed with the 'must be in a machine room to not deafen you' level of fans that modern servers tend to be found with. The larger servers were physically large and could require special power (and generate enough noise that you didn't want them around you), but the smaller 'server' models could look just like a desktop workstation (at least until you counted up how many SCSI disks were cabled to them).

Sidebar: An example of repurposing older servers as workstations

At one point, I worked with an environment that used DEC's MIPS-based DECstations. DEC's 5000/2xx series were available either as a server, without any graphics hardware, or as a workstation, with graphics hardware. At one point we replaced some servers with better ones; I think they would have been 5000/200s being replaced with 5000/240s. At the time I was using a DECstation 3100 as my system administrator workstation, so I successfully proposed taking one of the old 5000/200s, adding the basic colour graphics module, and making it my new workstation. It was a very nice upgrade.

OpenBSD versus FreeBSD pf.conf syntax for address translation rules

Chris's Wiki :: blog

By: cks

20 September 2024 at 02:53

I mentioned recently that we're looking at FreeBSD as a potential replacement for OpenBSD for our PF-based firewalls (for the reasons, see that entry). One of the things that will determine how likely we are to try this is how similar the pf.conf configuration syntax and semantics are between OpenBSD pf.conf (which all of our current firewall rulesets are obviously written in) and FreeBSD pf.conf (which we'd have to move them to). I've only done preliminary exploration of this but the news has been relatively good so far.

I've already found one significant syntax (and to some extent semantics) difference between the two PF ruleset dialects, which is that OpenBSD does BINAT, redirection, and other such things by means of rule modifiers; you write a 'pass' or a 'match' rule and add 'binat-to', 'nat-to', 'rdr-to', and so on modifiers to it. In FreeBSD PF, this must be done as standalone translation rules that take effect before your filtering rules. In OpenBSD PF, strategically placed (ie early) 'match' BINAT, NAT, and RDR rules have much the same effect as FreeBSD translation rules, causing your later filtering rules to see the translated addresses; however, 'pass quick' rules with translation modifiers combine filtering and translation into one thing, and there's not quite a FreeBSD equivalent.

That sounds abstract, so let's look at a somewhat hypothetical OpenBSD RDR rule:

pass in quick on $INT_IF proto {udp tcp} \
     from any to <old-DNS-IP> port = 53 \
     rdr-to <new-DNS-IP>

Here we want to redirect traffic to our deprecated old DNS resolver IP to the new DNS IP, but only DNS traffic.

In FreeBSD PF, the straightforward way would be two rules:

rdr on $INT_IF proto {udp tcp} \
    from any to <old-DNS-IP> port = 53 \
    -> <new-DNS-IP> port 53

pass in quick on $INT_IF proto {udp tcp} \
     from any to <new-DNS-IP> port = 53

In practice we would most likely already have the 'pass in' rule, and also you can write 'rdr pass' to immediately pass things and skip the filtering rules. However, 'rdr pass' is potentially dangerous because it skips all filtering. Do you have a single machine that is just hammering your DNS server through this redirection and you want to cut it off? You can't add a useful 'block in quick' rule for it if you have a 'rdr pass', because the 'pass' portion takes effect immediately. There are ways to work around this but they're not quite as straightforward.

(Probably this alone would push us to not using 'rdr pass'; there's also the potential confusion of passing traffic in two different sections of the pf.conf ruleset.)

Fortunately we have very few non-'match' translation rules. Turning OpenBSD 'match ... <whatever>-to <ip>' pf.conf rules into the equivalent FreeBSD '<whatever> ...' rules seems relatively mechanical. We'd have to make sure that the IP addresses our filtering rules saw continued to be the internal ones, but I think this would be work out naturally; our firewalls that do NAT and BINAT translation do it on their external interfaces, and we usually filter with 'pass in' rules.

(There may be more subtle semantic differences between OpenBSD and FreeBSD pf rules. A careful side by side reading of the two pf.conf manual pages might turn these up, but I'm not sure I can read the two manual pages that carefully.)

(5 comments.)

How to talk to a local IPMI under FreeBSD 14

Chris's Wiki :: blog

By: cks

26 August 2024 at 03:15

Much like Linux and OpenBSD, FreeBSD is able to talk to a local IPMI using the ipmi kernel driver (or device, if you prefer). This is imprecise although widely understood terminology; in more precise terms, FreeBSD can talk to a machine's BMC (Baseboard Management Controller) that implements the IPMI specification in various ways which you seem to normally not need to care about (for information on 'KCS' and 'SMIC', see the "System Interfaces" section of OpenBSD's ipmi(4)).

Unlike in OpenBSD (covered earlier), the stock FreeBSD 14 kernel appears to report no messages if your machine has an IPMI interface but the driver hasn't been enabled in the kernel. To see if your machine has an IPMI interface that FreeBSD can talk to, you can temporarily load the ipmi module with 'kldload ipmi'. If this succeeds, you will see kernel messages that might look like this:

ipmi0: <IPMI System Interface> port 0xca8,0xcac irq 10 on acpi0
ipmi0: KCS mode found at io 0xca8 on acpi
ipmi0: IPMI device rev. 1, firmware rev. 7.10, version 2.0, device support mask 0xdf
ipmi0: Number of channels 2
ipmi0: Attached watchdog
ipmi0: Establishing power cycle handler

(On the one Dell server I've tried this on so far, the ipmi(4) driver found the IPMI without any special parameters.)

At this point you should have a /dev/ipmi0 device and you can 'pkg install ipmitool' and talk to your IPMI. To make this permanent, you edit /boot/loader.conf to load the driver on boot, by adding:

ipmi_load="YES"

While you're there, you may also want to load the coretemp(4) module or perhaps amdtemp(4). After updating loader.conf, you need to reboot to make it take full effect, although since you can kldload everything before then I don't think there's a rush.

In FreeBSD, IPMI sensor information isn't visible in sysctl (although information from coretemp or amdtemp is). You'll need ipmitool or another suitable program to query it. You can also use ipmitool to configure the basics of the IPMI's networking and set the IPMI administrator's password to something you know, as opposed to whatever unique value the machine's vendor set it to, which you may or may not have convenient access to.

(As far as I can tell, ipmitool works the same on FreeBSD as it does on Linux, so if you have existing scripts and so on that use it for collecting data on your Linux hosts (as we do), they will probably be easy to make work on any FreeBSD machines you add.)

What a POSIX shell has to do with $PWD

Chris's Wiki :: blog

By: cks

22 August 2024 at 02:24

It's reasonably well known about Unix people that '$PWD' is a shell variable with the name of the current working directory. Well, sort of, because sometimes $PWD isn't right or isn't even set (all of this is part of the broader subject of shells and the current directory). Until recently, I hadn't looked up what POSIX has to say about $PWD, and when I did I was surprised, partly because I didn't expect POSIX to say anything about it.

(Until I looked it up, I had the vague impression that $PWD was a common but non-POSIX Bourne shell thing.)

What POSIX has to say is in 2.5.3 Shell Variables part of the overall description of the POSIX shell. To put my own summary on what POSIX says, the shell creates and maintains $PWD in basically all circumstances, and is obliged to update $PWD when it does a 'cd', even in shell scripts. The only case where $PWD's value isn't specified in the shell environment is if you don't have access permissions for the current directory for some reason.

(As far as I can tell, the complicated POSIX wording boils down to that if you start the shell with a correct $PWD that uses symbolic links (eg '/u/cks' instead of '/h/281/cks'), the shell is allowed to update that to the post-symlink 'physical' version but doesn't have to. See how 'pwd -P' is described.)

However, $PWD is not necessarily correct when you're running a program written in C, because POSIX chdir() doesn't seem to be required to update $PWD for you (although it's a bit confusing, since Environment Variables seems to imply that POSIX utilities are entitled to believe $PWD is correct if it's in the environment). In fact I don't think that the POSIX shell is always obliged to export $PWD into the environment, which is why I called it a shell variable instead of an environment variable. I believe most actual Bourne shell implementations do always export $PWD, even if they're started in an environment with it undefined (where I believe POSIX allows it to not be exported).

(Bash, Dash, and FreeBSD's Almquist shell all allow $PWD to be unexported, although keeping it that way may be tricky in Dash and FreeBSD sh, which appear to re-export it any time you do a 'cd'.)

The upshort of this is that in a modern environment where /bin/sh is a POSIX shell, $PWD will almost always be correct. It pretty much has to be correct in your POSIX shell sessions and in your POSIX shell scripts. POSIX-compatible shells like Bash will keep it correct even in their more expansive modes, and non-Bourne shells have a strong motive to go with the show because people expect $PWD to work and be correct.

(However, this leaves me mystified about what the problem was in my specific circumstance this time around, since I'd expect $PWD to have gotten set correctly when my /bin/sh based script used 'cd'.)

(One comment.)

FreeBSD's 'root on ZFS' default appeals to me for an odd reason

Chris's Wiki :: blog

By: cks

17 August 2024 at 02:23

For reasons beyond the scope of this entry, we're probably going to take a look at FreeBSD as an alternative to OpenBSD for some of our uses of the latter. This got me to grab a 14.1 ISO image and try a quick install on a spare virtual machine (I keep spare VMs around for just such occasions). This caused me to discover that modern FreeBSD defaults to using ZFS for its root filesystem (although I didn't do this on my VM test install, because my VM has less than the recommended RAM for ZFS). FreeBSD using ZFS for its root filesystem makes me happy, but probably not quite for the reasons you're expecting.

Certainly, I like ZFS in general and I think it has a bunch of nice properties, even for a root filesystem. You get checksums for reliability, compression, the ability to easily add sub-filesystems if you want to limit the amount of space something can use (we have usage cases for this, but that's another entry), and so on. But these aren't what make me happy for it as a root filesystem on FreeBSD. The really nice thing about root on ZFS on FreeBSD for me is the easy mirroring.

A traditional thing with all of our non-Linux installs is that they don't have mirrored system disks. We've made some stabs at it in the past but at the time we found it complex and not clearly compelling, perhaps partly because we didn't have experience with their software mirroring systems. Well, we have a lot of experience with mirroring ZFS vdevs and it's trivial to set ZFS mirroring up after the fact or to revert back from a mirrored setup to a single-disk setup. So while we might not bother going through the hassles of learning a FreeBSD-specific software mirroring system, we're pretty likely to use ZFS mirroring on any production FreeBSD machines. And that will be a good thing for our FreeBSD machines in general.

(Using ZFS for the root filesystem also eliminates any chance that the server will ever stall in boot asking us to approve a fsck, something that has happened to our OpenBSD machines under rare circumstances.)

I'm also personally pleased to see a fully supported 'root on ZFS' in anything. My impression is that FreeBSD is reasonably well used, so their choice of ZFS for the default root filesystem setup may even be exposing a reasonable number of people to (Open)ZFS and its collection of nice things.

PS: our OpenBSD machines come in pairs and we've had very good luck with their root drives, or we might have looked into the OpenBSD bioctl(8) software mirroring system and how you install to a mirror.

(3 comments.)

Host names in syslog messages may not be quite what you expect

Chris's Wiki :: blog

By: cks

7 August 2024 at 02:20

Over on the Fediverse, I said something:

It has been '0' days since I (re)discovered that the claimed hostname in syslog messages can be utter junk, and you may be going to live a fun life if you use it for anything much.

Suppose that on your central syslog server you see a syslog line of the form:

[...] alkyone exim[864974]: no host name found for IP address 115.187.17.119

You might reasonably assume that the host name 'alkyone' comes from the central syslog daemon knowing the host name of the host that sent the syslog message to it. Unfortunately, this is not what actually happens. As covered in places like RFC 5424 section 6.2.4 (or RFC 3164 section 4.1.2 for the nominal 'BSD' syslog format, which seem to not actually be what BSD used), syslog messages carry an embedded hostname in them. This hostname is generated by the machine that originated the message, and the machine can put anything it wants to in there. And generally, your syslog daemon (and the log format it's using) will write this hostname into the logs and otherwise use it if you ask for the message's 'hostname'.

(Rsyslog and probably other syslog daemons can create per-host message files on your central syslog server, which can cause you to want a hostname for each message.)

The intent of this embedded hostname is noble; it's there so you can have syslog relays (which may happen accidentally), where the originating system sends its messages to host A and host A relays them to host B, and B records the hostname as the originating system, not host A. Unfortunately, in practice all sorts of things can go wrong, including a quite fun one.

The first thing that can go wrong is systems that have a different view of their hostname than you do. On Unix systems, the normal syslog hostname traditionally comes from whatever the general host name is set to, which isn't necessarily a fully qualified domain name and doesn't necessarily match what its IP address is (you can change the IP address of a system but forget to update its hostname). Some embedded systems will have an internally set host name instead of trying to deduce it from DNS lookups of whatever IP they have, which can cause them to use syslog hostnames like 'idrac-<asset-tag>' (for the BMC of a Dell server with that particular asset tag).

The most fun case is an interaction with a long-standing syslog feature (that I think is often disabled today):

<host> /bsd: arp: attempt to overwrite entry for [...]
last message repeated 2 times

You'll notice that the second message doesn't say '<host> last message repeated ...'. This is achieved with the extremely brute force method of setting the hostname in the message to 'last'. If your central syslog server then attempts to set up per-host syslog logs, you will wind up with a 'last' host (with extremely uninteresting logs).

Also, if people send not quite random garbage to your syslog server's listening network ports (perhaps because they are a vulnerability scanner or nmap or the like), your syslog daemon and your logs can wind up seeing all sorts of weird junk as the nominal hostname. The syslog message format is deliberately relatively liberal and syslog servers have traditionally been even more liberal about interpreting things that arrived on it, on the sensible grounds that it's usually better to record everything you get just in case.

Sidebar: Hostnames in syslog messages appear to be new-ish

In 4.2 BSD, the syslog daemon was part of the sendmail source code, and sendmail/aux/syslog.c doesn't get the hostname from the message but instead from the IP address it came from. I think this continues right through 4.4 BSD if I'm reading the code right. RFC 3164 dates from 2001, so presumably people augmented the syslog format some time before then.

Interestingly, RFC 3164 specifically says that the host name in the message must not include the domain name. I suspect that even at the time this was widely ignored in practice for good operational reasons.

The uncertain possible futures of Unix graphical desktops

Chris's Wiki :: blog

By: cks

27 July 2024 at 02:40

Once upon a time, the future of Unix desktops looked fairly straightforward. Everyone ran on X, so the major threat to cross-Unix portability in major desktops was the use of Linux only APIs, which became especially D-Bus and systemd related things. Unix desktops that were less attached to tight integration with the Linux environment would probably stay easily available on FreeBSD, OpenBSD, and so on.

What happened to this nice simple vision was Wayland becoming the future of (Linux) graphics. Linux is the primary target of KDE and especially Gnome, so Wayland being the future on Linux has gotten developers for Gnome to start moving toward a Wayland-only vision. Wayland is unapologetically not cross-platform the way X was, which leaves other Unixes with a problem and creates a number of possible future for Unix desktops.

In one future, other Unixes imitate Linux, implementing enough APIs to run Wayland and the other Linux things that in practice it depends on, and as a result they can probably continue to provide the big Linux-focused desktop environments like Gnome. I believe that FreeBSD is working on this approach, although I don't know if Gnome on Wayland on FreeBSD works yet. This allows the other Unix to mostly look like Linux, desktop-wise. As an additional benefit, it allows the other Unix to also use other, more minimal Wayland compositors (ie, window managers) that people may like, such as Sway (the one everyone mentions).

In another future, other Unixes don't attempt to chase Linux by implementing APIs to get Wayland and Gnome and so on to run, and instead stick with X. As desktops, major toolkits, and applications drop support for X or break working on it through lack of use and lack of caring, these Unixes are likely to increasingly be left with old-fashioned X environments that are a lot more 'window manager' than they are 'desktop'. There are people, me included, who would be more or less happy with this state of affairs (in my case, as long as Firefox and a few other applications keep working). I suspect that this is the path that OpenBSD will stick with, and my guess is that anyone using OpenBSD for their desktop or laptop environment will be happy with this.

An unpleasant variant of this future comes about if Firefox and other applications are aggressive about dropping support for X. This would leave X-only Unixes as a backwater, stuck with (at best) old versions of important tools such as web browsers. There are some people who would still be happy with this, but probably not many.

Broadly, I think there is going to be a split between what you could call the Linux desktop (Wayland based with a major desktop environment such as Gnome, even if it's on FreeBSD instead of Linux), perhaps the Wayland desktop (Wayland based with compositor like Sway instead of a full blown desktop environment), and an increasingly limited Unix desktop that over time will find itself having to move from being a desktop environment to being a window manager environment (as the desktop environments stop working well on X).

PS: One big question about the future of the Unix desktop is how many desktop environments will get good Wayland support and then abandon X. Right now, there are a fair number of desktop environments that have little or no Wayland support and a reasonable user base. The existence and popularity of these environments helps drive demand for continued X support in toolkits and so on. Of course, major Linux distributions may throw X-only desktops overboard someday, regardless of usage.

(6 comments.)

Seeing and matching pf rules when using tcpdump on OpenBSD's pflog interface

Chris's Wiki :: blog

By: cks

24 July 2024 at 02:45

Last year I wrote about some special tcpdump filtering options for OpenBSD's pflog interface, including the 'rnr <number>' option for matching and showing only packets blocked by a specific rule. You might want to do this if, for example, you temporarily throw brute force attacker IPs into a table and want to take them out soon after they stop hitting you.

Assuming that you're watching live, the way you do this is to find the rule number with 'pfctl -vv -s rules | grep @ | grep <term>' for a suitable term, such as the table name (or look through the whole thing with a pager), and then run 'tcpdump -n -i pflog0 "rnr <number>"'. However, looking up rule numbers is annoying and a clever person might remember that the OpenBSD tcpdump can print the pf rule information for pflog packets, through the '-e' option (for pflog, this is considered the link-level header). So you might think that the easy way to achieve what you want is 'tcpdump -n -i pflog0 | grep <term>', which is to say you're dumping all pflog packets and then picking out the ones that matched your rule.

Unfortunately, the pflog 'link-level header' doesn't actually tell you this. What it has is the rule number, whether the packet was blocked or not (you can log without blocking), which direction the block was (in or out), and what interface (plus that the packet was blocked because it matched a rule):

21:20:43.525222 rule 231/(match) block in on ix1: [...]

Quite sensibly, you don't get the actual contents of the rule that blocked the packet, so you can't grep for it and my clever idea was not so clever. If you read all the way to the Link Level Headers section of the OpenBSD tcpdump manual page, it explicitly tells you this:

On the packet filter logging interface pflog(4), logging reason (rule match, bad-offset, fragment, bad-timestamp, short, normalize, memory), action taken (pass/block), direction (in/out) and interface information are printed out for each packet.

So don't be like me and waste your time with the 'grep the tcpdump output' approach. It isn't going to work and you're going to have to do it the hard way.

As far as I know there's no way to attach some sort of marker to rules in your pf.conf that will make them easy to pick out in pflog(4) packets. Based on the pflog(4) manual page, the packet format just doesn't have room for that. If you absolutely need to know this sort of thing for sure, even over rule changes, I think your only option is to log the packets to a non-default pflog(4) interface and then arrange for something to receive and store stuff from that interface.

The NFS server 'subtree' export problem

Chris's Wiki :: blog

By: cks

11 June 2024 at 03:06

NFS servers have a variety of interesting problems that ultimately exist because NFS was first defined a long time ago in a world where (Unix) filesystems were simpler and security was perhaps less of a concern. One of their classical problems is that how NFS clients identify files is surprisingly limited. Another problem is what I will call the 'subtree export' issue.

Suppose that you have a filesystem called '/special', and this filesystem contains directory trees '/special/a' and '/special/b'. The first directory tree is exported only to one NFS client, A, and the second directory tree is exported only to another NFS client, B. Now suppose that client A presents the NFS server with a request to read some file, which it identifies by an NFS filehandle. How does the NFS server know that this file is located under /special/a, the only part of the /special filesystem that client A is supposed to have access to? This is the subtree export issue.

(This problem comes up because NFS clients can forge their own NFS filehandles or copy NFS filehandles from other clients, and NFS filehandles generally don't contain the path to the object being accessed. Normally all the NFS server can recover from an NFS filehandle is the filesystem and some non-hierarchical identifier for the object, such as its inode number.)

The very early NFS servers ignored the entire problem because they started out with no NFS filehandle access checks at all. Even when NFS servers started applying some access checks to NFS filehandles, they generally ignored the subtree issue because they had no way to deal with it. If you exported a subtree of a filesystem to some client, in practice the client could access the entire filesystem if it made up appropriate valid NFS filehandles. The manual pages sometimes warned you about this.

(One modern version of this warning appears in the the FreeBSD exports manual page. The Illumos share_nfs manual page doesn't seem to discuss this subtree issue, so I don't know how Illumos handles it.)

Some modern NFS servers try to do better, and in particular the Linux kernel NFS server does. Linux does this by trying to work out the full path within the filesystem of everything you access, leveraging the kernel directory entry caches and perhaps filesystem specific information about parent directories. On Linux, and in general on any system where the NFS server attempts to do this, checking for this subtree export issue adds some overhead to NFS operations and may possibly reject some valid NFS operations because the NFS server can't be sure that the request is within an allowed subtree. Because of this, Linux's exports(5) NFS options support a 'no_subtree_check' option that disables this check and in fact any security check that requires working out the parent of something.

Generally, the subtree export issue is only a problem if you think NFS clients can be compromised to present NFS filehandles that you didn't give them. If you only export a subtree of a filesystem to a NFS client, a properly operating NFS environment will deny the client's request to mount anything else in the filesystem, which will stop the client from getting NFS filehandles for anything outside its allowed subtree.

(This still leaves you with the corner case of moving a file or a directory tree from inside the client's allowed subtree to outside of it. If the NFS client is currently using the file or directory, it will still likely be able to keep accessing it until it stops and forgets the file's NFS filehandle.)

Obviously, life is simpler if you only export entire filesystems to NFS clients and don't try to restrict them to subtrees. If a NFS client only wants a subtree, it can do that itself.

Maybe understanding uname(1)'s platform and machine fields

Chris's Wiki :: blog

By: cks

5 June 2024 at 21:17

When I wrote about some history and limitations of uname(1) fields, I was puzzled by the differences between 'uname -m', 'uname -i', and 'uname -p' in the two variants of uname that have all three, Linux uname and Illumos uname. Illumos is descended from (Open)Solaris, and although I can't find manual pages for old Solaris versions of uname online, I suspect that Solaris is probably the origin of both '-i' and '-p' (the '-m' option comes from the original System V version that also led to POSIX uname). The Illumos manual page doesn't explain the difference, but it does refer to sysinfo(2), which has some quite helpful commentary if you read various bits and pieces. So here is my best guess at the original meanings of the three different options in Solaris.

Going from most general to most specific, it seems to be that on Solaris:

-p tells you the broad processor ISA or architecture, such as 'sparc', 'i386', or 'amd64' (or 'x86_64' if you like that label). This is what Illumos sysinfo(2) calls SI_ARCHITECTURE.
-m theoretically tells you a more specific processor and machine type. For SPARC specifically, you can get an idea of Solaris's list of these in the Debian Wiki SunSparc page (and also Wikipedia's Sun-4 architecture list).
-i theoretically tells you about the specific desktop or server platform you're on, potentially down to a relatively narrow model family; the Illumos sysinfo(2) section on SI_PLATFORM gives 'SUNW,Sun-Fire-T200' as one example.

Of course, 'uname -m' came first, and '-p' and '-i' were added later. I believe that Solaris started out being relatively specific in 'uname -m', going along with the System V and POSIX definition of it as the 'hardware type' or machine type. Once Solaris had done that, it couldn't change the output of 'uname -m' even as people started to want a broader processor ISA label, hence '-p' being the more generic version despite -m being the more portable option.

(GNU Coreutils started with only -m, added '-p' in 1996, and added '-i' in 2001. The implementation of both -p and -i initially only used sysinfo() to obtain the information.)

On x86 hardware, it seems that Unixes chose to interpret 'uname -m' generically, instead of trying to be specific about things like processor families. Especially in the early days of x86 Unix, the information needed for 'uname -i' probably just wasn't available, and also wasn't necessarily particularly useful. The Illumos sysinfo(2) section on SI_PLATFORM suggests that it just returns 'i86pc' on all conventional x86 platforms.

(GNU Coreutils theoretically supports '-i' and '-p' on Linux, but in practice both will normally report "unknown".)

Of course, once x86 Unixes started reporting generic things for 'uname -m', they were stuck with it due to backward compatibility with build scripts and other things that people had based on the existing output (and it's not clear what more specific x86 information would be useful for 'uname -m', although for 32-bit x86, people have done variant names of 'i586' and 'i686'). While there was some reason to support 'uname -p' for compatibility, it is probably not surprising that on both FreeBSD and OpenBSD, the output of 'uname -m' is probably mostly the same as 'uname -p'.

(OpenBSD draws a distinction between the kernel architecture and the application architecture, per the OpenBSD uname(1). FreeBSD draws a distinction between the 'hardware platform' (uname -m) and the 'processor architecture' (uname -p), per the FreeBSD uname(1), but on the FreeBSD x86 machine I have access to, they produce the same output. However, see the FreeBSD arch(7) manual page and this FreeBSD bug from 2017.)

PS: In a comment on my first uname entry. Phil Pennock showed that 'uname -m' and 'uname -p' differed in some macOS environments. I suspect the difference is following the FreeBSD model but I'm not sure.

Sidebar: The details of uname -p and uname -i on Linux

Linux distributions normally use the GNU Coreutils version of 'uname'. In its normal state, Coreutils' uname.c gets this information from either sysinfo() or sysctl(), if either support obtaining it (see the end of src/uname.c). On Linux, the normal C library sysinfo() and sysctl() don't support this, so normally 'uname -p' and 'uname -i' will both report 'unknown', since they're left with no code that can determine this information.

The Ubuntu (and perhaps Debian) package for coreutils carries a patch, originally from Fedora (but no longer used there), that uses the machine information that 'uname -m' would report to generate the information for -p and -i. The raw machine information can be modified a bit for both -i and -p. For -i, all 'i?86' results are turned into 'i386', and for -p, if the 'uname -m' result is 'i686', uname checks /proc/cpuinfo to see if you have an AMD and reports 'athlon' instead if you do (although this code may have decayed since it was written).

(One comment.)

Some history and limitations of uname(1) fields

Chris's Wiki :: blog

By: cks

5 June 2024 at 02:16

Uname(1) is a command that hypothetically prints some potentially useful information about your system. In practice what information it prints, how useful that information is, and what command line options it supports varies widely between both different sorts of Unixes and between different versions of Linux (due to using different versions of GNU Coreutils, and different patches for it). I was asked recently if this situation ever made any sense and the general answer is 'maybe'.

In POSIX, uname(1) is more or less a program version of the uname() function. It supports only '-m', '-n', '-r', '-s', and '-v', and as a result of all of these arguments being required by POSIX, they are widely supported by the various versions of uname that are out there in various Unixes. All other arguments are non-standard and were added well after uname(1) initially came into being, which is one reason they are so divergent in presence and meaning; there is no enhanced ancestral 'uname' command for things to descend from.

The uname command itself comes from the System V side of Unix; it was first added as far back as at least System III, where the System III uname.c accepts -n, -r, -s, and -v with the modern meanings. System III gets the information from the kernel, in a utssys() system call. I believe that System V added the 'machine' information ('-m'), which then was copied straight into POSIX. On the BSD side, a uname command first appeared in 4.4 BSD, and the 4.4 BSD uname(1) manual page says that it also had the POSIX arguments, including -m. The actual implementation didn't use a uname() system call but instead extracted the information with sysctl() calls.

The modern versions of uname that I can find manual pages for are rather divergent; contrast Linux uname(1) (also), FreeBSD uname(1), OpenBSD uname(1), NetBSD uname(1), and Illumos uname(1) (manual pages for other Unixes are left as an exercise). For instance, take the '-i' argument, supported in Linux and Illumos to print a theoretical hardware platform and FreeBSD to print the 'kernel ident'. On top of that difference, on Linux distributions that use an unpatched build of Coreutils, I believe that 'uname -i' and 'uname -p' will both report 'unknown'.

(Based on how everyone has the -p argument for processor type, I suspect that it was one of the earliest additions to POSIX uname. How 'uname -m' differs from 'uname -p' in practice is something I don't know, but apparently people felt a need to distinguish the two at some point. Some Internet searches suggest that on Unixes such as Solaris, the processor type might be 'sparc' while the machine hardware name might be more specific, like 'sun4m'.)

On Linux and several other Unixes, much of the core information for uname comes from the kernel, which means that options like 'uname -r' and 'uname -v' have traditionally reported about the kernel version and build string, not anything to do with the general release of Linux (or Unix). On FreeBSD, the kernel release is usually fairly tightly connected to the userland version (although FreeBSD uname can tell you about the latter too), but on Linux it is not, and the Linux uname has no option to report a 'distribution' name or version.

In general, I suspect that the only useful fields you can count on from uname(1) are '-n' (some version of the hostname), '-s' (the broad operating system), and perhaps '-m', although you probably want to be wary about that. One of the cautions with 'uname -m' is that there is no agreement between Unixes about what the same hardware platform should be called; for example, OpenBSD uses 'amd64' for 64-bit x86 while Linux uses 'x86_64'. Illumos recommends using 'uname -p' instead of 'uname -m'.

(This entry's topic was suggested to me by Hunter Matthews, although I suspect I haven't answered their questions about historical uname values.)

Sidebar: An options comparison for non-POSIX options

The common additional options are:

-i: semi-supported on Linux, supported on Illumos, and means something different on FreeBSD (it's the kernel identifier instead of the hardware platform).
-o: supported on Linux, where it is often different from 'uname -s', FreeBSD, where it is explicitly the same as 'uname -s', and Illumos, where I don't know how it relates to 'uname -s'.
-p: semi-supported on Linux and fully supported on FreeBSD, OpenBSD, NetBSD, and Illumos. Illumos specifically suggests using 'uname -p' instead of 'uname -m', which will generally make you sad on Linux.

FreeBSD has -b, -K, and -U as additional FreeBSD specific arguments.

How 'uname -i', 'uname -p', and 'uname -m' differ on Illumos is not something I know; they all report something about the hardware, but the Illumos uname manpage mostly doesn't illuminate the difference. It's possible that this is more or less covered in sysinfo(2).

(The moral is that you can't predict the result of these options without running uname on an applicable system, or at least having a lot of OS-specific knowledge.)

(3 comments.)

Turning off the X server's CapsLock modifier

Chris's Wiki :: blog

By: cks

16 May 2024 at 03:05

In the process of upgraded my office desktop to Fedora 40, I wound up needing to turn off the X server's CapsLock modifier. For people with a normal keyboard setup, this is simple; to turn off the CapsLock modifier, you tap the CapsLock key. However, I turn CapsLock into another Ctrl key (and then I make heavy use of tapping CapsLock to start dmenu (also)), which leaves the regular CapsLock functionality unavailable to me under normal circumstances. Since I don't have a CapsLock key, you might wonder how the CapsLock modifier got turned on in the first place.

The answer is that sometimes I have a CapsLock key after all. I turn CapsLock into Ctrl with setxkbmap settings, and apparently some Fedora packages clears these keyboard mapping settings when they're updated. Since upgrading to a new Fedora release updates all of these packages, my 'Ctrl' key resets to CapsLock during the process and I don't necessarily notice immediately. Because I expect my input settings to get cleared, I have a script to re-establish them, which I run when I notice my special Ctrl key handling isn't working. What happened this time around was that I noticed that my keyboard settings had been cleared when CapsLock didn't work as Ctrl, then reflexively invoked the script. Of course at this point I had tapped CapsLock, which turned on the CapsLock modifier, and then when the script reset CapsLock to be Ctrl, I no longer had a key that I could use to turn CapsLock off.

(Actually dealing with this situation was made more complicated by how I could now only type upper case letters in shells, browser windows, and so on. Fortunately I had a phone to do Internet searches on, and I could switch to another Linux virtual console, which had CapsLock off, and access the X server with 'export DISPLAY=:0' so I could run commands that talked to it.)

There are two solutions I wound up with, the narrow one and the general one. The narrow solution is to use xdotool to artificially send a CapsLock key down/up event with this:

xdotool key Caps_Lock

This will toggle the state of the CapsLock modifier in the X server, which will turn CapsLock off if it's currently on, as it was for me. This key down/up event works even if you have the CapsLock key remapped at the time, as I did, and you can run it from another virtual console with 'DISPLAY=:0 xdotool key Caps_Lock' (although you may need to vary the :0 bit). Or you can put it in a script called 'RESET-CAPSLOCK' so you can type its name with CapsLock active.

(Possibly I should give my 'reset-input' script an all-caps alias. It's also accessible from a window manager menu, but modifiers can make those inaccessible too.)

However, I'd like something to clear the CapsLock modifier that I can put in my 're-establish my keyboard settings' script, and since this xdotool trick only toggles the setting it's not suitable. Fortunately you can clear modifier states from an X client; unfortunately, as far as I know there's no canned 'capslockx' program the way there is a numlockx (which people have and use for good reasons). Fortunately, the same AskUbuntu question and answer that I got the xdotool invocation from also had a working Python program (you want the one from this answer by diegogs. For assorted reasons, I'm putting my current version of that Python program here:

#!/usr/bin/python
from ctypes import *

class Display(Structure):
  """ opaque struct """

X11 = cdll.LoadLibrary("libX11.so.6")
X11.XOpenDisplay.restype = POINTER(Display)

display = X11.XOpenDisplay(c_int(0))
X11.XkbLockModifiers(display, c_uint(0x0100), c_uint(2), c_uint(0))
X11.XCloseDisplay(display)

(There is also a C version in the question and answers, but you obviously have to compile it.)

In theory there is probably some way to reset the setxkbmap settings state so that CapsLock is a CapsLock key again (after all, package updates do it), which would have let me directly turn off CapsLock. In practice I couldn't find out how to do this in my flailing Internet searches so I went with the answer I could find. In retrospect I could might also have been able to reset settings by unplugging and replugging my USB keyboard or plugging in a second keyboard, and we do have random USB keyboards sitting around in the office.

(3 comments.)

The X Window System and the curse of NumLock

Chris's Wiki :: blog

By: cks

15 May 2024 at 03:15

In X, like probably any graphical environment, there are a variety of layers to keys and characters that you type. One of the layers is the input events that the X server sends to applications. As covered in the xlib manual, these contain a keycode, representing the nominal physical key, a keysym, representing what is nominally printed on the key, and a bitmap of the modifiers currently in effect, which are things like 'Shift' or 'Ctrl' (cf). The separation between keycodes and keysyms lets you do things like remap your QWERTY keyboard to Dvorak; you tell X to change what keysyms are generated for a bunch of the keycodes. Programs like GNU Emacs read the state of the modifiers to determine what you've typed (from their perspective), so they can distinguish 'Ctrl-Return' from plain 'Return'.

Ordinary modifiers are normally straightforward, in that they are additional keys that are held down as you type the main key. Control, Shift, and Alt all work this way (by default). However, some modifiers are 'sticky', where you tap their key once to turn them on and then tap their key again to turn them off. The obvious example of this is Caps Lock (unless you turn its effects off, remapping its physical key to be, say, another Ctrl key). Another example, one that many X users have historically wound up quietly cursing, is NumLock. Why people wind up cursing NumLock, and why I have a program to control its state, is because of how X programs (such as window managers) often do their key and mouse button bindings.

(There are also things that will let you make non-sticky modifier keys into sticky keys.)

Suppose, for example, that you have a bunch of custom fvwm mouse bindings that are for things like 'middle mouse button plus Alt', 'middle mouse button plus Shift and Alt', 'plain right mouse button on the root', and so on. Fvwm and most other X programs will normally (have to) interpret this completely literally; when you create a binding for 'middle mouse plus Alt', the state of the current modifiers must be exactly 'Alt' and nothing else. If the X server has NumLock on for some reason (such as you hitting the key on the keyboard), the state of the current modifiers will actually be 'NumLock plus Alt', or 'NumLock plus Alt and Shift', or just 'NumLock' (instead of 'no modifiers in effect'). As a result, fvwm will not match any of your bindings and nothing will happen as you're poking away at your keyboard and your mouse.

Of course, this can also happen with CapsLock, which has the same sticky behavior. But CapsLock has extremely obvious effects when you type ordinary characters in terminal windows, editors, email, and so on, so it generally doesn't take very long before people realize they have CapsLock on. NumLock doesn't normally change the main letters or much of anything else; on some keyboard layouts, it may not change anything you can physically type. As a result, having NumLock on can be all but invisible (or completely invisible on keyboards with no NumLock LED). To make it worse, various things have historically liked 'helpfully' turning NumLock on for you, or starting in a mode with NumLock on.

(X programs can alter the current modifier status, so it's possible for NumLock to get turnd on even if there is no NumLock key on your keyboard. The good news is that this also makes it possible to turn it off again. A program can also monitor the state of modifiers, so I believe there are ones that give you virtual LEDs for some combination of CapsLock, ScrollLock, and NumLock.)

So the curse of NumLock in X is that having NumLock on can be cause mysterious key binding failures in various programs, while often being more or less invisible. And for X protocol reasons, I believe it's hard for window managers to tell the X server 'ignore NumLock when considering my bindings' (see, for example, the discussion of IgnoreModifiers in the fvwm3 manual).

(4 comments.)

The importance of an ordinary space in a Unix shell command line

Chris's Wiki :: blog

By: cks

26 April 2024 at 03:17

In the sidebar to yesterday's entry I (originally) made a Unix command line mistake by unthinkingly leaving out an ordinary, innocent looking space (it's corrected in the current version of the entry after it was noted by Emilio in a comment). This innocent looking mistake and its consequences are an illustration of something in Unix shell command lines, although I'm not sure of just what, so I'm going to write it up.

The story starts with the general arguments of Bash's 'read' builtin:

read [-ers] [-a aname] [-d delim] [-i text] [-n nchars] [-N nchars] [-p prompt] [-t timeout] [-u fd] [name …]

The 'read' builtin follows the general standard behavior of Unix commands where '-d delim' and other options that take an argument can be shortened to omit the space, so '-ddelim'. So you can write, for example:

echo "a:b:c" | while IFS= read -r -d':' l; do echo "$l"; done

Bash also has a special feature for -d. Normally the first character of delim is taken as the 'line' terminator, but if delim is blank, read will terminate the line when it reads a NUL character (0 byte), which is just what you want to handle the output of, for example, 'find ... -print0'.

The way you create an empty string argument in a Bash command line is to use an empty pair of quotes:

read -r -d '' line

So when I was writing the original command line in yesterday's entry, I absently mashed these two things together in my mind and wrote:

read -r -d'' line

I've used '' to create an empty argument and then I've done the standard thing of removing the space between -d and its argument. So clearly I've given '-d' an empty argument, right? Nope.

In Bash and other conventional shells, '' is nothingness. It only means an argument that is an empty string if it occurs on its own; this is a special interpretation added by the shell, and programs don't actually see the ''s. If you put a '' next to other non-whitespace characters, it disappears in the command line that the program will see. So writing -d'' was the same as writing -d with no argument, and the command line as 'read' would see it was actually:

read -r -d line

Which would have caused 'read' to use 'l' as the line terminator.

In the process of writing this entry, I realized that there's a more interesting way to make what is fundamentally the same mistake, although it goes deeper into Unix arcana and doesn't look half as obvious. In many modern shells, the Bourne shell included, you can write a NUL character (0 byte) as $'\0'. So you will see people write a 'read with NUL terminated lines' command line as:

IFS= read -r -d $'\0' line

This works fine, and unlike the '' case we obviously have a real argument here, not just an empty argument, so clearly we can shorten this to:

IFS= read -r -d$'\0' line

If you try this you will discover it doesn't work. The fundamental problem is that Unix command line arguments can't include NUL characters, because the Unix command line API passes the arguments as an array of NUL-terminate (C) strings. No matter how you invoke a program, the first NUL character in an argument is the end of that argument from the program's perspective. So although it looked very different as typed, from read's perspective what we did was the same as:

IFS= read -r -d line

(And then it would have the same effect as my mistake.)

PS: This is a little tangled because 'read' is a Bash builtin so in theory Bash doesn't have to stick to the limits of the kernel API, but in practice I think Bash does do so.

(5 comments.)

What the original 4.2 BSD csh hashed (which is not what I thought)

Chris's Wiki :: blog

By: cks

21 April 2024 at 03:38

Recently, Unix shells keeping track of where they'd found commands came up on the Fediverse again, as it does every so often; for instance, last year I advocated for doing away with the whole thing. As far as I know, (Unix) shell command hashing originated with BSD Unix's csh. which added command hashing and a 'rehash' builtin. However, if you actually read the 4.2 BSD csh(1) manual page, it says something a bit odd (emphasis mine):

rehash: Causes the internal hash table of the contents of the directories in the path variable to be recomputed. This is needed if new commands are added to directories in the path while you are logged in. [...]

The way command hashing typically works in modern shells is that the shell remembers the specific full path to a given command (or sometimes that the command doesn't exist). This is explicitly described in the Bash manual, which says (for example) 'Bash uses a hash table to remember the full pathnames of executable files'. In this case, if you or someone else adds a new command to something in $PATH and you've never run that command before (because it didn't used to exist), you're fine and don't need to rehash; your shell will automatically go looking for a new command in $PATH.

It turns out that the 4.2 BSD csh did not hash commands this way. Instead, well, let's quote a comment from sh.exec.c:

Xhash is an array of HSHSIZ chars, which are used to hash execs. If it is allocated, then to tell whether ``name'' is (possibly) present in the i'th component of the variable path, you look at the i'th bit of xhash[hash("name")]. This is setup automatically after .login is executed, and recomputed whenever ``path'' is changed.

To translate that, csh does not 'hash' where commands are found the way modern shells do. Instead of looking up commands and then remembering where it found them, it scans all of the directories on your $PATH and remembers the hash values of the names it saw in each of them. When csh tries to run a command, it gets the hash value of the command name, looks it up in the hash table, and skips all $PATH entries that hash value definitely isn't in. If you run a newly added command, the odds are very low that its name will hash to a hash value that has the right bit set in its hash table entry.

There can be hash value collisions between different command names and if you have more than 8 $PATH entries, more than one entry can set the same bit, so finding a set bit merely means that potentially the command is there. So this is not as good as remembering exactly where the command is, but on the other hand it takes up a lot less memory; the default csh hash size is 511 bytes. It also means that you definitely want to do 'rehash' when you or someone else modifies any directory on your $PATH, because the odds are very high that any new additions won't be properly recognized.

(What 'rehash' does is that it re-runs the code that sets up this hash table, which is also run when $PATH is changed and so on.)

Bash's sadly flawed smart (programmable) completion

Chris's Wiki :: blog

By: cks

10 April 2024 at 03:32

Bash has an interesting and broadly useful feature called 'programmable completion' (this has sort of come up before). Programmable completion makes it possible for Bash to auto-complete things like command line options for the current program for you. Unfortunately one flaw in Bash's programmable completion is that it doesn't understand enough Bash command line syntax and so can get in your way.

Suppose, not hypothetically, that you are typing the following on the Bash command line on a Debian-based Linux system, with the bits in bold being what you typed before you hit tab:

# apt-get install $(grep -v '^#' somefi<TAB>

When you hit TAB to complete the file name, nothing will happen. This is because Bash has been told what the arguments to apt-get are and so 'knows' that they don't include files (this is actually wrong these days, but never mind). Bash isn't smart enough to recognize that by typing '$(' you've started writing a command substitution and are now in a completely different completion context, that of grep, where you definitely should be allowed to complete file names.

Bash could in theory be this smart but there are probably a number of obstacles to doing that in practice. For example, we don't have a well-formed command substitution here, since we haven't typed the closing ')' yet; Bash would effectively have to do something like the sort of on the fly parsing of incomplete code that editor autocompletion does. It's possible that Bash could do better with some heuristics, but the current situation is broadly easy to explain and reason about, even if the result is sometimes frustrating.

There are at least two ways to disable these programmable completions in Bash. You can turn off the feature entirely with 'shopt -u progcomp' or you can flush all of the registered programmable completions with 'complete -r'. In theory you can see the list of current completions with 'complete -p' and then remove just one of them with 'complete -r <name>', but in practice 'complete -p' doesn't always list a program until I've started trying to do completions with it.

(Where the shell snippets that define completions go is system dependent, but Linux systems will often put them in '/usr/share/bash-completion/completions'.)

People with better memories than me can also use M-/ instead to force completion of a filename no matter what Bash's programmable completion thinks should go there. M-! will complete command names, M-$ will complete variable names, and M-~ will complete user names. You can find these in the Bash manual page as the various 'complete-*' readline (key binding) command names.

(Another general flaw on programmable completion is that it relies on the people who provide the completion definitions for commands getting it right and they don't always do that, as I've seen in the past.)

(3 comments.)

A peculiarity of the X Window System: Windows all the way down

Chris's Wiki :: blog

By: cks

6 March 2024 at 02:26

Every window system has windows, as an entity. Usually we think of these as being used for, well, windows and window like things; application windows, those extremely annoying pop-up modal dialogs that are always interrupting you at the wrong time, even perhaps things like pop-up menus. In its original state, X has more windows than that. Part of how and why it does this is that X allows windows to nest inside each other, in a window tree, which you can still see today with 'xwininfo -root -tree'.

One of the reasons that X has copious nested windows is that X was designed with a particular model of writing X programs in mind, and that model made everything into a (nested) window. Seriously, everything. In an old fashioned X application, windows are everywhere. Buttons are windows (or several windows if they're radio buttons or the like), text areas are windows, menu entries are each a window of their own within the window that is the menu, visible containers of things are windows (with more windows nested inside them), and so on.

This copious use of windows allows a lot of things to happen on the server side, because various things (like mouse cursors) are defined on a per-window basis, and also windows can be created with things like server-set borders. So the X server can render sub-window borders to give your buttons an outline and automatically change the cursor when the mouse moves into and out of a sub-window, all without the client having to do anything. And often input events like mouse clicks or keys can be specifically tied to some sub-window, so your program doesn't have to hunt through its widget geometry to figure out what was clicked. There are more tricks; for example, you can get 'enter' and 'leave' events when the mouse enters or leaves a (sub)window, which programs can use to highlight the current thing (ie, subwindow) under the cursor without the full cost of constantly tracking mouse motion and working out what widget is under the cursor every time.

The old, classical X toolkits like Xt and the Athena widget set (Xaw) heavily used this 'tree of nested windows' approach, and you can still see large window trees with 'xwininfo' when you apply it to old applications with lots of visible buttons; one example is 'xfontsel'. Even the venerable xterm normally contains a nested window (for the scrollbar, which I believe it uses partly to automatically change the X cursor when you move the mouse into the scrollbar). However, this doesn't seem to be universal; when I look at one Xaw-based application I have handy, it doesn't seem to use subwindows despite having a list widget of things to click on. Presumably in Xaw and perhaps Xt it depends on what sort of widget you're using, with some widgets using sub-windows and some not. Another program, written using Tk, does use subwindows for its buttons (with them clearly visible in 'xwininfo -tree').

This approach fell out of favour for various reasons, but certainly one significant one is that it's strongly tied to X's server side rendering. Because these subwindows are 'on top of' their parent (sub)windows, they have to be rendered individually; otherwise they'll cover what was rendered into the parent (and naturally they clip what is rendered to them to their visible boundaries). If you're sending rendering commands to the server, this is just a matter of what windows they're for and what coordinates you draw at, but if you render on the client, you have to ship over a ton of little buffers (one for each sub-window) instead of one big one for your whole window, and in fact you're probably sending extra data (the parts of all of the parent windows that gets covered up by child windows).

So in modern toolkits, the top level window and everything in it is generally only one X window with no nested subwindows, and all buttons and other UI elements are drawn by the client directly into that window (usually with client side drawing). The client itself tracks the mouse pointer and sends 'change the cursors to <X>' requests to the server as the pointer moves in and out of UI elements that should have different mouse cursors, and when it gets events, the client searches its own widget hierarchy to decide what should handle them (possibly including client side window decorations (CSD)).

(I think toolkits may create some invisible sub-windows for event handling reasons. Gnome-terminal and other Gnome applications appear to create a 1x1 sub-window, for example.)

As a side note, another place you can still find this many-window style is in some old fashioned X window managers, such as fvwm. When fvwm puts a frame around a window (such as the ones visible on windows on my desktop), the specific elements of the frame (the title bar, any buttons in the title bar, the side and corner drag-to-resize areas, and so on) are all separate X sub-windows. One thing I believe this is used for is to automatically show an appropriate mouse cursor when the mouse is over the right spot. For example, if your mouse is in the right side 'grab to resize right' border, the mouse cursor changes to show you this.

(The window managers for modern desktops, like Cinnamon, don't handle their window manager decorations like this; they draw everything as decorations and handle the 'widget' nature of title bar buttons and so on internally.)

(4 comments.)

An illustration of how much X cares about memory usage

Chris's Wiki :: blog

By: cks

5 March 2024 at 03:02

In a comment on yesterday's entry talking about X's server side graphics rendering, B.Preston mentioned that another reason for this was to conserve memory. This is very true. In general, X is extremely conservative about requiring memory, sometimes to what we now consider extreme lengths, and there are specific protocol features (or limitations) related to this.

The modern approach to multi-window graphics rendering is that each window renders into a buffer that it owns (often with hardware assistance) and then the server composites (appropriate parts of) all of these buffers together to make up the visible screen. Often this compositing is done in hardware, enabling you to spin a cube of desktops and their windows around in real time. One of the things that clients simply don't worry about (at least for their graphics) is what happens when someone else's window is partially or completely on top of their window. From the client's perspective, nothing happens; they keep drawing into their buffer and their buffer is just as it was before, and all of the occlusion and stacking and so on are handled by the composition process.

(In this model, a client program's buffer doesn't normally get changed or taken away behind the client's back, although the client may flip between multiple buffers, only displaying one while completely repainting another.)

The X protocol specifically does not require such memory consuming luxuries as a separate buffer for each window, and early X implementations did not have them. An X server might have only one significant-sized buffer, that being screen memory itself, and X clients drew right on to their portion of the screen (by sending the X server drawing commands, because they didn't have direct access to screen memory). The X server would carefully clip client draw operations to only touch the visible pixels of the client's window. When you moved a window to be on top of part of another window, the X server simply threw away (well, overwrote) the 'under' portion of the other window. When the window on top was moved back away again, the X server mostly dealt with this by sending your client a notification that parts of its window had become visible and the client should repaint them.

(X was far from alone with this model, since at the time almost everyone was facing similar or worse memory constraints.)

The problem with this 'damage and repaint' model is that it can be janky; when a window is moved away, you get an ugly result until the client has had the time to do a redraw, which may take a while. So the X server had some additional protocol level features, called 'backing store' and 'save-under(s)'. If a given X server supported these (and it didn't have to), the client could request (usually during window creation) that the server maintain a copy of the obscured bits of the new window when it was covered by something else ('backing store') and separately that when this window covered part of another window, the obscured parts of that window should be saved ('save-under', which you might set for a transient pop-up window). Even if the server supported these features in general it could specifically stop doing them for you at any time it felt like it, and your client had to cope.

(The X server can also give your window backing store whether or not you asked for it, at its own discretion.)

All of this was to allow an X server to flexibly manage the amount of memory it used on behalf of clients. If an X server had a lot of memory, it could give everything backing store; if it started running short, it could throw some or all of the backing store out and reduce things down to (almost) a model where the major memory use was the screen itself. Even today you can probably arrange to start an X server in a mode where it doesn't have backing store (the '-bs' command line option, cf Xserver(1), which you can try in Xnest or the like today, and also '-wm'). I have a vague memory that back in the day there were serious arguments about whether or not you should disable backing store in order to speed up your X server, although I no longer have any memory about why that would be so (but see).

As far as I know all X servers normally operate with backing store these days. I wouldn't be surprised if some modern X clients would work rather badly if you ran them on an X server that had backing store forced off (much as I suspect that few modern programs will cope well with PseudoColor displays).

PS: Now that I look at 'xdpyinfo', my X server reports 'options: backing-store WHEN MAPPED, save-unders NO'. I suspect that this is a common default, since you don't really need save-unders if everything has backing store enabled when it's visible (well, in X mapped is not quite 'visible', cf, but close enough).

(One comment.)

X graphics rendering as contrasted to Wayland rendering

Chris's Wiki :: blog

By: cks

4 March 2024 at 03:56

Recently, Thomas Adam (of fvwm fame) pointed out on the FVWM mailing list (here, also) a difference between X and Wayland that I'd been vaguely aware of before but hadn't actually thought much about. Today I feel like writing it down in my own words for various reasons.

X is a very old protocol (dating from the mid to late 1980s), and one aspect of that is that it contains things that modern graphics protocols don't. From a modern point of view, it isn't wrong to describe X as several protocols in a trenchcoat. Two of the largest such protocols are one for what you could call window management (including event handling) and a second one for graphics rendering. In the original vision of X, clients used the X server as their rendering engine, sending a series of 2D graphics commands to the server to draw things like lines, rectangles, arcs, and text. In the days of 10 Mbit/second local area networks and also slow inter-process communication on your local Unix machine, this was a relatively important part of both X's network transparency story and X's performance in general. We can call this server (side) rendering.

(If you look at the X server drawing APIs, you may notice that they're rather minimal and generally lack features that you'd like to do modern graphics. Some of this was semi-fixed in X protocol extensions, but in general the server side X rendering APIs are rather 1980s.)

However, X clients didn't have to do their rendering in the server. Right from the beginning they could render to a bitmap on the client side and then shove the bitmap over to the server somehow (the exact mechanisms depend on what X extensions are available). Over time, more and more clients started doing more and more client (side) rendering, where they rendered everything under their own control using their own code (well, realistically a library or a stack of them, especially for complex things like rendering fonts). Today, many clients and many common client libraries are entirely or almost entirely using client side rendering, in part to get modern graphics features that people want, and these days clients even do client side (window) decoration (CSD), where they draw 'standard' window buttons themselves.

(This tends to make window buttons not so standard any more, especially across libraries and toolkits.)

As a protocol designed relatively recently, Wayland is not several protocols in a trenchcoat. Instead, the (core) Wayland protocol is only for window management (including event handling), and it has no server side rendering. Wayland clients have to do client side rendering in order to display anything, using whatever libraries they find convenient for this. Of course this 'rendering' may be a series of OpenGL commands that are drawn on to a buffer that's shared with the Wayland server (what is called direct rendering (cf), which is also the common way to do client side rendering in X), but this is in some sense a detail. Wayland clients can simply render to bitmaps and then push those bitmaps to a server, and I believe this is part of how waypipe operates under the covers.

(Since Wayland was more or less targeted at environments with toolkits that already had their own graphics rendering APIs and were already generally doing client side rendering, this wasn't seen as a drawback. My impression is that these non-X graphics APIs were already in common use in many modern clients, since it includes things like Cairo. One reason that people switched to such libraries and their APIs even before Wayland is that the X drawing APIs are, well, very 1980s, and don't have a lot of features that modern graphics programming would like. And you can draw directly to a Wayland buffer if you want to, cf this example.)

One implication of this is that some current X programs are much easier to port (or migrate) to Wayland than others. The more an X program uses server side X rendering, the more it can't simply be re-targeted to Wayland, because it needs a client side library to substitute for the X server side rendering functionality. Generally such programs are either old or were deliberately written to be minimal X clients that didn't depend on toolkits like Gtk or even Cairo.

(Substituting in a stand alone client side drawing library is probably not a small job, since I don't think any of them so far are built to be API compatible with the relevant X APIs. It also means taking on additional dependencies for your program, although my impression is that some basic graphics libraries are essentially standards by now.)

(2 comments.)