Suppose, not entirely hypothetically, that you've made local
changes to an Ubuntu package using dgit
and now Ubuntu has come out with an update to that package that you
want to switch to, with your local changes still on top. Back when
I wrote about moving local changes to a new Ubuntu release with
dgit, I wrote an appendix with a theory
of how to do this, based on aconversation. Now that I've
actually done this, I've discovered that there is a minor variation
and I'm going to write it down explicitly (with additional notes
because I forgot some things between then
and now).
I'll assume we're starting from an existing dgit based repository
with a full setup of local changes,
including an updated debian/changelog. Our first step, for safety,
is to make a branch to capture the current state of our repository.
I suggest you name this branch after the current upstream package
version that you're on top of, for example if the current upstream
version you're adding local changes to can be summarized as
'ubuntu2.6':
git branch cslab-2.6
Making a branch allows you to use 'git diff cslab-2.6..' later to
see exactly what changed between your versions. A useful thing to
do here is to exclude the 'debian/' directory from diffs, which can
be done with 'git diff cslab-2.6.. -- . :!debian', although
your shell may require you to quote the '!' (cf).
Then we need to use dgit to fetch
the upstream updates:
dgit fetch -d ubuntu
We need to use '-d ubuntu', at least in current versions of dgit,
or 'dgit fetch' gets confused and fails. At this point we have the
updated upstream in the remote tracking branch
'dgit/dgit/jammy,-security,-updates' but our local tree is still
not updated.
(All of dgit's remote tracking branches start with 'dgit/dgit/',
while all of its local branches start with just 'dgit/'. This is
less than optimal for my clarity.)
Normally you would now rebase to shift your local changes on top
of the new upstream, but we don't want to immediately do that. The
problem is that our top commit is our own dgit-based change to
debian/changelog, and we don't want to rebase that commit; instead
we'll make a new version of it after we rebase our real local
changes. So our first step is to discard our top commit:
git reset --hard HEAD~
(In my original theory I didn't realize
we had to drop this commit before the rebase, not after, because
otherwise things get confused. At a minimum, you wind up with
debian/changelog out of order, and I don't know if dropping your
HEAD commit after the rebase works right. It's possible you might
get debian/changelog rebase conflicts as well, so I feel dropping
your debian/changelog change before the rebase is cleaner.)
Now we can rebase, for which the simpler two-argument form does
work (but not plain rebasing,
or at least I didn't bother testing plain rebasing):
git rebase dgit/dgit/jammy,-security,-updates dgit/jammy,-security,-updates
(If you are wondering how this command possibly works, as I was
part way through writing this entry, note that the first branch is
'dgit/dgit/...', ie our remote tracking branch, and then second
branch is 'dgit/...', our local branch with our changes on it.)
At this point we should have all of our local changes stacked on top
of the upstream changes, but no debian/changelog entry for them that
will bump the package version. We create that with:
Then we can build with 'dpkg-buildpackage -uc -b', and afterward
do 'git clean -xdf; git reset --hard' to reset your tree back to
its pristine state.
(My view is that while you can prepare a source package for your
work if you want to, the 'source' artifact you really want to save
is your dgit VCS repository. This will be (much) less bulky when
you clean it up to get rid of all of the stuff (to be polite) that
dpkg-buildpackage leaves behind.)
Suppose, not entirely hypothetically, that you've traditionally
used /etc/resolv.conf on your Ubuntu servers but you're considering
switching to systemd-resolved, partly for fast failover if your
normal primary DNS server is unavailable
and partly because it feels increasingly dangerous not to, since
resolved is the normal configuration and what software is likely
to expect. One of the ways that resolv.conf is nice is that you can
set the configuration by simply copying a single file that isn't
used for anything else. On Ubuntu, this is unfortunately not the
case for systemd-resolved.
Canonical expects you to operate all of your Ubuntu server networking
through Canonical Netplan. In reality,
Netplan will render things down to a systemd-networkd configuration,
which has some important effects and
creates some limitations. Part of
that rendered networkd configuration is your DNS resolution settings,
and the natural effect of this is that they have to be associated
with some interface, because that's the resolved model of the
world. This means that Netplan specifically
attaches DNS server information to a specific network interfaces
in your Netplan configuration. This means that you must find the
specific device name and then modify settings within it, and those
settings are intermingled (in the same file) with settings you can't
touch.
Netplan does not give you a way to do this; if anything, Netplan
goes out of its way to not do so. For example, Netplan can dump its
full or partial configuration, but it does so in YAML form with no
option for JSON (which you could readily search through in a script
with jq). However, if you want to modify the Netplan YAML without
editing it by hand, 'netplan set' sometimes requires JSON as input.
Lack of any good way to search or query Netplan's YAML matters
because for things like DNS settings, you need to know the right
interface name. Without support for this in Netplan, you wind
up doing hacks to try to get the right interface name.
Netplan also doesn't provide you any good way to remove settings.
The current Ubuntu 26.04 beta installer writes a Netplan configuration
that locks your interfaces to specific MAC addresses:
This is rather undesirable if you may someday swap network cards
or transplant server disks from one chassis to another, so we would
like to automatically take it out. Netplan provides no support for
this; 'netplan set' can't be given a blank replacement, for example
(and 'netplan set "network.ethernets.enp1s0.match={}"' doesn't
do anything). If Netplan would give you all of the enp1s0 block in
JSON format, maybe you could edit the JSON and replace the whole
thing, but that's not available so far.
(For extra complication you also need to delete the set-name, which
is only valid with a 'match:'.)
Another effect of not being able to delete things in scripts is
that you can't write scripts that move things out to a different
Netplan .conf file that has only your settings for what you care
about. If you could reliably get the right interface name and you
could delete DNS settings from the file the installer wrote, you
could fairly readily create a '/etc/netplan/60-resolv.conf' file
that was something close to a drop-in /etc/resolv.conf. But as it
is, you can't readily do that.
There are all sorts of modifications you might want to make through
a script, such as automatically configuring a known set of VLANs to attach them to whatever the appropriate
host interface is. Scripts are good for automation and they're also
good for avoiding errors, especially if you're doing repetitive
things with slight differences (such as setting up a dozen VLANs
on your DHCP server). Netplan fights you almost all the way about
doing anything like this.
My best guess is that all of Canonical's uses of Netplan either use
internal tooling that reuses Netplan's (C) API or simply
re-write Netplan files from scratch (based on, for example, cloud
provider configuration information).
(To save other people the time, the netplan Python package on
PyPI seems to be a third party
package and was last updated in 2019. Which is a pity, because it
theoretically has a quite useful command line tool.)
One bleakly amusing thing I've found out through using 'netplan
set' on Ubuntu 26.04 is that the Ubuntu server installer and Netplan
itself have slightly different views on how Netplan files should
be written. The original installer version of the above didn't have
the quotes around the strings; 'netplan set' added them.
(All of this would be better if there was a widely agreed on,
generally shipped YAML equivalent of 'jq', or better yet something
that could also modify YAML in place as well as query it in forms
that were useful for automation. But the 'jq for YAML' ecosystem
appears to be fragmented at best.)
The other day I wrote about a brute force approach to mapping
IPv4 /24 subnets to Autonomous System Numbers (ASNs), where I built a big, somewhat
sparse file of four-byte records, with the record for each /24 at
a fixed byte position determined by its first three octets (so
0.0.0.0/24's ASN, if any, is at byte 0, 0.0.1.0/24 is at byte 4,
and so on). My initial approach was to open, lseek(), and read()
to access the data; in a comment, Aristotle Pagaltzis wondered if mmap() would perform better.
The short answer is that for my specific case I think it would be
worse, but the issue is interesting to talk about.
(In general, my view is that you should use mmap() primarily if it
makes the code cleaner and simpler. Using mmap() for performance
is a potentially fraught endeavour that you need to benchmark.)
In my case I have two strikes against mmap() likely being a performance
advantage: I'm working in Python (and specifically Python 2) so I
can't really directly use the mmap()'d memory, and I'm normally
only making a single lookup in the typical case (because my program
is running as a CGI). In the non-mmap() case I expect to do an
open(), an lseek(), and a read() (which will trigger the kernel
possibly reading from disk and then definitely copying data to me).
In the mmap() case I would do open(), mmap(), and then access some
page, triggering possible kernel IO and then causing the kernel to
manipulate process memory mappings to map the page into my address
space. In general, it seems unlikely that mmap() plus the page
access handling will be cheaper than lseek() plus read().
(In both the mmap() and read() cases I expect two transitions into
and out of the kernel. As far as I know, lseek() is a cheap system
call (and certainly it seems unlikely to be more expensive than
mmap(), which has to do a bunch of internal kernel work), and the
extra work the read() does to copy data from the kernel to user
space is probably no more work than the kernel manipulating page
tables, and could be less.)
If I was doing more lookups in a single process, I could possibly
win with the mmap() approach but it's not certain. A lot depends
on how often I would be looking up something on an already mapped
page and how expensive mapping in a new page is compared to some
number of lseek() plus read() system calls (or pread() system calls
if I had access to that, which cuts the number of system calls in
half). In some scenarios, such as a burst of traffic from the same
network or a closely related set of networks, I could see a high
hit rate on already mapped pages. In others, the IPv4 addresses are
basically random and widely distributed, so many lookups would
require mapping new pages.
(Using mmap() makes it unnecessary to keep my own in-process cache,
but I don't think it really changes what the kernel will cache for
me. Both read()'ing from pages and accessing them through mmap()
keeps them recently used.)
Things would also be better in a language where I could easily make
zero-copy use of data right out of the mmap()'d pages themselves.
Python is not such a language, and I believe that basically any
access to the mmap()'d data is going to create new objects and copy
some bytes around. I expect that this results in as many intermediate
objects and so on as if I used Python's read() stuff.
(Of course if I really cared there's no substitute for actually
benchmarking some code. I don't care that much, and the code is
simpler with the regular IO approach because I have to use the
regular IO approach when writing the data file.)
As far as I know, virt-manager and virsh don't directly allow you
to switch a virtual machine between BIOS and UEFI after it's been
created, partly because the result is probably not going to boot
(unless you deliberately set up the OS inside the VM with both an
EFI boot and a BIOS MBR boot environment). Within virt-manager, you
can only select BIOS or UEFI at setup time, so you have to destroy
your virtual machine and recreate it. This works, but it's a bit
annoying.
(On the other hand, if you've had some virtual machines sitting
around for years and years, you might want to refresh all of their
settings anyway.)
It's possible to change between BIOS and UEFI by directly editing
the libvirt XML to transform the <os> node.
You may want to remove any old snapshots first because I don't know
what happens if you revert from a 'changed to UEFI' machine to a
snapshot where your virtual machine was a BIOS one. In my view, the
easiest way to get the necessary XML is to create (or recreate)
another virtual machine with UEFI, and then dump and copy its XML
with some minor alterations.
For me, on Fedora with the latest libvirt and company, the <os>
XML of a BIOS booting machine is:
Here the '[machine-name]' bit is the libvirt name of my virtual
machine, such as 'vmguest1'. This nvram file doesn't have to exist
in advance; libvirt will create it the first time you start up the
virtual machine. I believe it's used to provide snapshots of the
UEFI variables and so on to go with snapshots of your physical disks
and snapshots of the virtual machine configuration.
(This feature may have landed in libvirt 10.10.0, if I'm
reading release notes correctly. Certainly reading the release
notes suggests that I don't want to use anything before then
with UEFI snapshots.)
Manually changing the XML on one of my scratch machines has worked
fine to switch it from BIOS MBR to UEFI booting as far as I can
tell, but I carefully cleared all of its disk state and removed all
of its snapshots before I tried this. I suspect that I could switch
it back to BIOS if I wanted to. Over time, I'll probably change
over all of my as yet unchanged scratch virtual machines to UEFI
through direct XML editing, because it's the less annoying approach
for me. Now that I've looked this up, I'll probably do it through
'virsh edit ...' rather than virt-manager, because that way I get
my real editor.
(This is the kind of entry I write for my future use because I
don't want to have to re-derive this stuff.)
Today I made an unpleasant discovery about virt-manager on my (still)
Fedora 42 machines that I shared on the Fediverse:
This is my face that Fedora virt-manager appears to have been
defaulting to external snapshots for some time and SURPRISE, external
snapshots can't be reverted by virsh. This is my face, especially as
it seems to have completely screwed up even deleting snapshots on some
virtual machines.
(I only discovered this today because today is the first time I
tried to touch such a snapshot, either to revert to it or to clean
it up. It's possible that there is some hidden default for what
sort of snapshot to make and it's only been flipped for me.)
Neither virt-manager nor virsh will clearly tell you about this.
In virt-manager you need to click on each snapshot and if it says
'external disk only', congratulations, you're in trouble. In virsh,
'virsh snapshot-list --external <vm>' will list external snaphots,
and then 'virsh snapshot-list --tree <vm>' will tell you if they
depend on any internal snapshots.
My largest problems came from virtual machines where I had earlier
internal snapshots and then I took more snapshots, which became
external snapshots from Fedora 41 onward. You definitely can't
revert to an external snapshot in this situation, at least not
with virsh or virt-manager, and the error messages I got were
generic ones about not being able to revert external snapshots.
I haven't tested reverting external snapshots for a VM with no
internal ones.
Update: you can revert an external snapshot in the latest libvirt
if all of your snapshots are external. You can't revert them if
libvirt helpfully gave you external snapshots on top of internal
ones by switching the default type of snapshots (probably in Fedora
41).
If you have internal snapshots and you're willing to throw away the
external snapshot and what's built on it, you can use virsh or
virt-manager to revert to an internal snapshot and then delete the
external snapshot. This leaves the external snapshot's additional
disk file or files dangling around for you to delete by hand.
If you have only an external snapshot, it appears that libvirt will
let you delete the snapshot through 'virsh snapshot-delete <vm>
<external-snapshot>', which preserves the current state of the
machine's disks. This only helps if you don't want the snapshot any
more, but this is one of my common cases (where I take precautionary
snapshots before significant operations and then get rid of them
later when I'm satisfied, or at least committed).
The worst situation appears to be if you have an external snapshot
made after (and thus on top of) an earlier internal snapshot and
you to keep the live state of things while getting rid of the
snapshots. As far as I can tell, it's impossible to do this through
libvirt, although some of the documentation suggests that you should
be able to. The process outlined in libvirt's Merging disk image
chains
didn't work for me (see also Disk image chains).
(If it worked, this operation would implicitly invalidate the
snapshots and I don't know how you get rid of them inside libvirt,
since you can't delete them normally. I suspect that to get rid of
them, you need to shut down all of the libvirt daemons and then
delete the XML files that (on Fedora) you'll find in
/var/lib/libvirt/qemu/snapshot/<domain>.)
One reason to delete external snapshots you don't need is if you
ever want to be able to easily revert snapshots in the future. I
wouldn't trust making internal snapshots on top of external ones,
if libvirt even lets you, so if you want to be able to easily revert,
it currently appears that you need to have and use only internal
snapshots. Certainly you can't mix new external snapshots with old
internal snapshots, as I've seen.
(The 5.1.0 virt-manager release
will warn you to not mix snapshot modes and defaults to whatever
snapshot mode you're already using. I don't know what it defaults
to if you don't have any snapshots, I haven't tried that yet.)
Sidebar: Cleaning this up on the most tangled virtual machine
$ virsh snapshot-delete hl-fedora-36 fedora41-preupgrade
error: Failed to delete snapshot fedora41-preupgrade
error: Operation not supported: deleting external snapshot that has internal snapshot as parent not supported
This VM has an internal snapshot as the parent because I didn't
clean up the first snapshot (taken before a Fedora 41 upgrade)
before making the second one (taken before a Fedora 42 upgrade).
In theory one can use 'virsh blockcommit' to reduce everything down
to a single file, per the knowledge base section on this.
In practice it doesn't work in this situation:
$ virsh blockcommit hl-fedora-36 vda --verbose --pivot --active
error: invalid argument: could not find base image in chain for 'vda'
(I tried with --base too and that didn't help.)
I was going to attribute this to the internal snapshot but then I
tried 'virsh blockcommit' on another virtual machine with only an
external snapshot and it failed too. So I have no idea how this is
supposed to work.
Since I could take a ZFS snapshot of the entire disk storage, I
chose violence, which is to say direct usage of qemu-img. First,
I determined that I couldn't trivially delete the internal snapshot
before I did anything else:
$ qemu-img snapshot -d fedora40-preupgrade fedora35.fedora41-preupgrade
qemu-img: Could not delete snapshot 'fedora40-preupgrade': snapshot not found
The internal snapshot is in the underlying file 'fedora35.qcow2'.
Maybe I could have deleted it safely even with an external thing
sitting on top of it, but I decided not to do that yet and proceed
to the main show:
Using 'qemu-img info fedora35.qcow2' showed that the internal snapshot
was still there, so I removed it with 'qemu-img snapshot -d' (this time
on fedora35.qcow2).
All of this left libvirt's XML drastically out of step with the
underlying disk situation. So I removed the XML for the snapshots
(after saving a copy), made sure all libvirt services weren't
running, and manually edited the VM's XML, where it turned out that
all I needed to change was the name of the disk file. This appears
to have worked fine.
I suspect that I could have skipped manually removing the internal
snapshot and its XML and libvirt would then have been happy to see
it and remove it.
(I'm writing all of the commands and results down partly for my
future reference.)
Traditionally, Wayland compositors have taken on the role of the
window manager as well, but this is not in fact a necessary step to
solve the architectural problems with X11. Although, I do not know
for sure why the original Wayland authors chose to combine the window
manager and Wayland compositor, I assume it was simply the path of
least resistance. [...]
Unfortunately, I believe that there are excellent reasons to put
the window manager into the display server the way Wayland has, and
the Wayland people (who were also X people) were quite familiar
with them and how X has had problems over the years because of its
split.
One large and more or less core problem is that event handling is
deeply entwined with window management. As an example, consider
this sequence of (input) events:
your mouse starts out over one window. You type some characters.
you move your mouse over to a second window. You type some more
characters.
you click a mouse button without moving the mouse.
you type more characters.
Your window manager is extremely involved in the decisions about
where all of those input events go and whether the second window
receives a mouse button click event in the third step. If the window
manager is separate from whatever is handling input events, either
some things trigger synchronous delays in further event handling
or sufficiently fast typeahead and actions are in a race with the
window manager to see if it handles changes in where future events
should go fast enough or if some of your typing and other actions
are misdirected to the wrong place because the window manager is
lagging.
Embedding the window manager in the display server is the simple
and obvious approach to insuring that the window manager can see
and react to all events without lag, and can freely intercept and
modify all events as it wishes without clients having to care.
The window manager can even do this using extremely local knowledge
if it wants. Do you want your window manager to have key bindings
that only apply to browser windows, where the same keys are passed
through to other programs? An embedded window manager can easily
do that (let's assume it can reliably identify browser windows).
(An outdated example of how complicated you can make mouse
button bindings, never mind keyboard bindings, is my mouse
button bindings in fvwm.)
X has a collection of mechanisms that try to allow window managers
to manage 'focus' (which window receives keyboard input), intercept
(some) keys at a window manager level, and do other things that
modify or intercept events. The whole system is complex, imperfect,
and limited, and a variety of these mechanisms have weird side
effects on the X events that regular programs receive; you can often
see this with a program such as xev. Historically, not all X
programs have coped gracefully with all of the interceptions that
window managers like fvwm can do.
X's mechanisms also impose limits on what they'll allow a window
manager to do. One famous example is that in X, mouse scroll wheel
events always go to the X window under the mouse cursor. Even if your window manager uses
'click (a window) to make it take input', mouse scroll wheel input
is special and cannot be directed to a window this way. In Wayland,
a full server has no such limitations; its window manager portion
can direct all events, including mouse scroll wheels, to wherever
it feels like.
Approximately all RPM packages are signed by GPG keys (or maybe
they're supposed to be called PGP keys), which your system stores
in the RPM database as pseudo-packages (because why not). If your
Fedora install has been around long enough, as mine have, you will
have accumulated a drift of old keys and sometimes you either want
to clean them up or something unfortunate will happen to one of
those keys (I'll get to one case for it).
One basic command to see your collection of GPG keys in the RPM
database is (taken from this gist):
On some systems this will give you a nice short list of keys. On others,
your list may be very long.
Since Fedora 42 (cf), DNF
has functionality (I believe more or less built in) that should
offer to remove old GPG keys that have actually expired. This is
in the 'expired PGP keys plugin'
which comes from the 'libdnf5-plugin-expired-pgp-keys' if you don't
have it installed (with a brief manpage that's called
'libdnf5-expired-pgp-keys'). I believe there was a similar DNF4
plugin. However, there are two situations where this seems to not
work correctly.
The first situation is now-obsolete GPG keys that haven't expired
yet, for various reasons; these may be for past versions of Fedora,
for example. These days, the metadata for every DNF repository you
use should list a URL for its GPG keys (see the various .repo files
in /etc/yum.repos.d/ and look for the 'gpgkey=' lines). So one
way to clean up obsolete keys is to fetch all of the current keys
for all of your current repositories (or at least the enabled ones),
and then remove anything you have that isn't among the list. This
process is automated for you by the 'clean-rpm-gpg-pubkey' command
and package, which is mentioned in some Fedora upgrade instructions.
This will generally clean out most of your obsolete keys, although
rare people will have keys that are so old that it chokes on them.
The second situation is apparently a repository operator who is
sufficiently clever to have re-issued an expired key using the same
key ID and fingerprint but a new expiry date in the future; this
fools RPM and related tools and everything chokes. This is unfortunate,
since it will often stall all DNF updates unless you disable the
repo. One repository operator who has done this is Google, for their
Fedora Chrome repository. To fix this you'll have to manually
remove the relevant GPG key or keys. Once you've used
clean-rpm-gpg-pubkey to reduce your list of GPG keys to a reasonable
level, you can use the RPM command I showed above to list all your
remaining keys, spot the likely key or keys (based on who owns it,
for example), and then use 'rpm -e --allmatches
gpg-pubkey-d38b4796-570c8cd3' (or some other appropriate gpg-pubkey
name) to manually scrub out the GPG key. Doing a DNF operation
such as installing or upgrading a package from the repository should
then re-import the current key.
(This also means that it's theoretically harmless to overshoot and
remove the wrong key, because it will be fetched back the next time
you need it.)
(When I wrote my Fediverse post about discovering clean-rpm-gpg-pubkey, I apparently
thought I would remember it without further prompting. This was
wrong, and in fact I didn't even remember to use it when I upgraded
my home desktop. This time it will hopefully stick, and if not, I
have it written down here where it will probably be easier to find.)
We've been operating Ubuntu servers for a long time and for most
of that time we've booted them through traditional MBR BIOS boots.
Initially it was entirely through MBR and then later it was still
mostly through MBR (somewhat depending on who installed a particular
server; my co-workers are more tolerant of UEFI than I am). But
when we built the 24.04 version of our customized install media,
my co-worker wound up making it UEFI only, and so for the past two
years all of our 24.04 machines have been UEFI (with us switching
BIOSes on old servers into UEFI mode as we updated them). The
headline news is that it's gone okay, more or less as you'd expect
and hope by now.
All of our servers have mirrored system disks, and the one UEFI
thing we haven't really had to deal with so far is fixing Ubuntu's
UEFI boot disk redundancy stuff after one disk fails. I think we know how to do it in
theory but we haven't had to go through it in practice. It will
probably work out okay but it does make me a bit nervous, along
with the related issue that the Ubuntu installer makes it hard to
be consistent about which disk your '/boot/efi' filesystem comes
from.
(In the installer, /boot/efi winds up on the first disk that you
set as the boot device, but the disks aren't always presented in
order so you can do this on 'the first disk' in the installer and
discover that the first disk it listed was /dev/sdb.)
The Ubuntu 24.04 default bootloader is GRUB, so that's what we've
wound up with even though as a UEFI-only environment we could in
theory use simpler ones, such as systemd-boot.
I'm not particularly enthused about GRUB but in practice it does
what we want, which is to reliably boot our servers, and it has the
huge benefit that it's actively supported by Ubuntu (okay, Canonical)
so they're going to make sure it works right, including with their
UEFI disk redundancy stuff. If Ubuntu
switches default UEFI bootloaders in their server installs, I expect
we'll follow along.
(I don't know if Canonical has any plans to switch away from GRUB
to something else. I suspect that they'll stick with GRUB for as
long as they support MBR booting, which I suspect will be a while,
especially as people look more and more likely to hold on to old
hardware for much longer than normally expected.)
PS: One reason I'm writing this down is that I've been unenthused
about UEFI for a long time, so I'm not sure I would have predicted
our lack of troubles in advance. So I'm going to admit it, UEFI has
been actually okay. And in its favour, UEFI has regularized some
things that used to be pretty odd in the MBR BIOS era.
(I'm still not happy about the UEFI non-story around redundant
system disks, but I've accepted that hacks like the Ubuntu approach are the best we're going to get. I
don't know what distributions such as Fedora are doing here; my
Fedora machines are MBR based and staying that way until the hardware
gets replaced, which on current trends won't be any time soon.)
The other day I covered how I think systemd's IPAddressAllow and
IPAddressDeny restrictions work, which
unfortunately only allows you to limit this to specific (local)
ports only if you set up the sockets for those ports in a separate
systemd.socket
unit. Naturally this raises the question of whether there is a good,
scalable way to restrict access to specific ports in eBPF that
systemd (or other interested parties) could use. I think the answer
is yes, so here is a sketch of how I think you'd this.
Why we care about a 'scalable' way to do this is because systemd
generates and installs its eBPF programs on the fly. Since tcpdump
can do this sort of cross-port matching, we could write an eBPF
program that did it directly. But such a program could get complex
if we were matching a bunch of things, and that complexity might
make it hard to generate on the fly (or at least make it complex
enough that systemd and other programs didn't want to). So we'd
like a way that still allows you to generate a simple eBPF program.
Systemd uses cgroup socket SKB eBPF programs, which
attach to a cgroup and filter all network packets on ingress or
egress. As far as I can understand from staring at code, these are
implemented by extracting the IPv4 or IPv4 address of the other
side from the SKB and then querying what eBPF calls a LPM (Longest
Prefix Match) map. The
normal way to use an LPM map is to use the CIDR
prefix length and the start of the CIDR network as the key (for
individual IPv4 addresses, the prefix length is 32), and then match
against them, so this is what systemd's cgroup program does. This
is a nicely scalable way to handle the problem; the eBPF program
itself is basically constant, and you have a couple of eBPF maps
(for the allow and deny sides) that systemd populates with the
relevant information from IPAddressAllow and IPAddressDeny.
However, there's nothing in eBPF that requires the keys to be just
CIDR prefixes plus IP addresses. A LPM map key
has to start with a 32-bit prefix, but the size of the rest of the
key can vary. This means that we can make our keys be 16 bits longer
and stick the port number in front of the IP address (and increase
the CIDR prefix size appropriately). So to match packets to port
22 from 128.100.0.0/16, your key would be (u32) 32 for the prefix
length then something like 0x00 0x16 0x80 0x64 0x00 0x00 (if I'm
doing the math and understanding the structure right). When you
query this LPM map, you put the appropriate port number in front
of the IP address.
This does mean that each separate port with a separate set of IP
address restrictions needs its own set of map entries. If you wanted
a set of ports to all have a common set of restrictions, you could
use a normally structured LPM map and a second plain hash map where the
keys are port numbers. Then you check the port and the IP address
separately, rather than trying to combine them in one lookup. And
there are more complex schemes if you need them.
Which scheme you'd use depends on how you expect port based access
restrictions to be used. Do you expect several different ports,
each with its own set of IP access restrictions (or only one port)?
Then my first scheme is only a minor change from systemd's current
setup, and it's easy to extend it to general IP address controls
as well (just use a port number of zero to mean 'this applies to
all ports'). If you expect sets of ports to all use a common set
of IP access controls, or several sets of ports with different
restrictions for each set, then you might want a scheme with more
maps.
(In theory you could write this eBPF program and set up these maps
yourself, then use systemd resource control features to attach
them to your .service unit.
In practice, at that point you probably should write host firewall
rules instead, it's likely to be simpler. But see this blog post and the
related VCS repository,
although that uses a more hard-coded approach.)
I recently wrote about things that make me so attached to xterm. One of those things is xterm's ziconbeep
feature, which causes xterm to visibly
and perhaps audibly react when it's iconified or minimized and gets
output. A commentator suggested that this feature should ideally
be done in the window manager, where it could be more general.
Unfortunately we can't do the equivalent of ziconbeep in the window
manager, or at least we can't do all of it.
A window manager can sound an audible alert when a specific type
of window changes its title in a certain way. This would give us
the 'beep' part of ziconbeep in a general way, although we're
treading toward a programmable window manager. But then, Gnome Shell
now does a lot of stuff in JavaScript and its extensions are written
in JS and the whole thing doesn't usually blow up. So we've got
prior art for writing an extension that reacts to window title
changes and does stuff.
What the window manager can't really do is reliably detect when the
window has new output, in order to trigger any beeping and change
the visible window title. As far as I know, neither X nor Wayland
give you particularly good visibility into whether the program is
rendering things, and in some ways of building GUIs, you're always
drawing things. In theory, a program might
opt to detect that it's been minimized and isn't visible and so not
render any updates at all (although it will be tracking what to
draw for when it's not minimized), but in practice I think this is
unfashionable because it gets in the way of various sorts of live
previews of minimized windows (where you want the window's drawing
surface to reflect its current state).
Another limitation of this as a general window manager feature is
that the window manager doesn't know what changes in the appearance
of a window are semantically meaningful and which ones are happening
because, for example, you just changed some font preference and the
program is picking up on that. Only the program itself knows what's
semantically meaningful enough to signal for people's attention.
A terminal program can have a simple definition but other programs
don't necessarily; your mail client might decide that only certain
sorts of new email should trigger a discreet 'pay attention to me'
marker.
(Even in a terminal program you might want more control over this
than xterm gives you. For example, you might want the terminal
program to not trigger 'zicon' stuff for text output but instead
to do it when the running program finishes and you return to the
shell prompt. This is best done by being able to signal the terminal
program through escape sequences.)
Among the systemd resource controls
are IPAddressAllow= and IPAddressDeny=,
which allow you to limit what IP addresses your systemd thing can
interact with. This is implemented with eBPF.
A limitation of these as applied to systemd .service units is that
they restrict all traffic, both inbound connections and things your
service initiates (like, say, DNS lookups), while you may want
only a simple inbound connection filter.
However, you can also set these on systemd.socket
units. If you do, your IP address restrictions apply only to the socket (or
sockets), not to the service unit that it starts. To quote the
documentation:
Note that for socket-activated services, the IP access list configured
on the socket unit applies to all sockets associated with it directly,
but not to any sockets created by the ultimately activated services
for it.
So if you have a systemd socket activated service, you can control
who can access the socket without restricting who the service itself
can talk to.
In general, systemd IP access controls are done through eBPF programs
set up on cgroups. If you set up IP access controls on a socket,
such as ssh.socket in Ubuntu 24.04, you do get such eBPF programs
attached to the ssh.socket cgroup (and there is a ssh.socket cgroup,
perhaps because of the eBPF programs):
# pwd
/sys/fs/cgroup/system.slice
# bpftool cgroup list ssh.socket
ID AttachType AttachFlags Name
12 cgroup_inet_ingress multi sd_fw_ingress
11 cgroup_inet_egress multi sd_fw_egress
However, if you look there are no processes or threads in the
ssh.socket cgroup, which is not really surprising but also means
there is nothing there for these eBPF programs to apply to. And if
you dump the eBPF program itself (with 'ebpftool dump xlated id
12'), it doesn't really look like it checks for the port number.
What I think must be going on is that the eBPF filtering program
is connected to the SSH socket itself. Since I can't find any
relevant looking uses in the systemd code of the `SO_ATTACH_*'
BPF related options from socket(7) (which
would be used with setsockopt(2) to
directly attach programs to a socket), I assume that what happens
is that if you create or perhaps start using a socket within a
cgroup, that socket gets tied to the cgroup and its eBPF programs,
and this attachment stays when the socket is passed to another
program in a different cgroup.
(I don't know if there's any way to see what eBPF programs are
attached to a socket or a file descriptor for a socket.)
If this is what's going on, it unfortunately means that there's no
way to extend this feature of socket units to get per-port IP
access control in .service units. Systemd
isn't writing special eBPF filter programs for socket units that
only apply to those exact ports, which you could in theory reuse
for a service unit; instead, it's arranging to connect (only)
specific sockets to its general, broad IP access control eBPF
programs. Programs that make their own listening sockets won't be
doing anything to get eBPF programs attached to them (and only
them), so we're out of luck.
(One could experiment with relocating programs between cgroups,
with the initial cgroup in which the program creates its listening
sockets restricted and the other not, but I will leave that up to
interested parties.)
I've said before in various contexts (eg)
that I'm very attached to the venerable xterm as my terminal
(emulator) program, and I'm not looking forward to the day that I
may have to migrate away from it due to Wayland (although I probably
can keep running it under XWayland, now that I think about it). But
I've never tried to write down a list of the things that make me
so attached to it over other alternatives like urxvt, much less
more standard ones like gnome-terminal. Today I'm going to try to
do that, although my list is probably going to be incomplete.
The ability to turn off all terminal colours, because they
often don't work in my preferred terminal colours. Other terminal programs have somewhat
different and sometimes less annoying colours, but it's still far
to easy for programs to display things in unreadable colours.
Yes, I can set my shell environment and many programs to not use
colours, but I can't set all of them; some modern programs simply
always use colours on terminals. Xterm can be set to completely
ignore them.
I'm very used to xterm's specific behavior when it comes to what
is a 'word' for double-click selection. You can read the full
details in the xterm manual page's section on character classes.
I'm not sure if it's possible to fully emulate this behavior in other
terminal programs; I once made an incomplete attempt in urxvt, while gnome-terminal is quite different and has little
or no options for customizing that behavior (in the Gnome way).
Generally the modern double click selection behavior is too broad for
me.
(For instance, I'm extremely attached to double-click selecting
only individual directories in full paths, rather than the entire
thing. I can always swipe to select an entire path, but if I
can't pick out individual path elements with a double click my
only choice is character by character selection, which is a giant
pain.)
Based on a quick experiment, I think I can make KDE's konsole
behave more or less the way I want by clearing out its entire set
of "Word characters" in profiles. I think this isn't quite how
xterm behaves but it's probably close enough for my reflexes.
Xterm doesn't treat text specially because of its contents, for
example by underlining URLs or worse, hijacking clicks on them to
do things. I already have well evolved systems for dealing with
things like URLs and I don't want my terminal emulator to provide
any 'help'. I believe that KDE's konsole can turn this off, but
gnome-terminal doesn't seem to have any option for it.
Many of xterm's behaviors can be controlled from command line
switches. Some other terminal emulators (like gnome-terminal)
force you to bundle these behaviors together as 'profiles' and
only let you select a profile. Similarly, a lot of xterm's behavior
can be temporarily changed on the fly through its context menus,
without having to change the profile's settings (and then change
them back).
Every xterm window is a completely separate program that starts
from scratch, and xterm is happy to run on remote servers without
complications; this isn't something I can say for all other
competitors. Starting from scratch also means things like not
deciding to place yourself where your last window was, which is
konsole's behavior (and infuriates me).
Of these, the hardest two to duplicate are probably xterm's double
click selection behavior of what is a word and xterm's large selection
behavior. The latter is hard because it requires the terminal program
to not use mouse button 3 for a popup menu.
I use some other xterm features, like key binding,
including duplicating windows, but I
could live without them, especially if the alternate terminal program
directly supports modern cut and paste
in addition to xterm's traditional style. And I'm accustomed to a
few of xterm's special control characters,
especially Ctrl-space, but I think this may be pretty universally
supported by now (Ctrl-space is in gnome-terminal).
There are probably things that other terminal programs like konsole,
gnome-terminal and so on do that I don't want them to (and that
xterm doesn't). But since I don't use anything other than xterm
(and a bit of gnome-terminal and once in a while a bit of urxvt),
I don't know what those undesired features are. Experimenting with
konsole for this entry taught me some things I definitely don't
want, such as it automatically placing itself where it was before
(including placing a new konsole window on top of one of the existing
ones, if you have multiple ones).
One of the famous things that people run into with the Bourne shell
is that it draws a distinction between plain shell variables and
special exported shell variables, which are put into the environment
of processes started by the shell. This distinction is a source of
frustration when you set a variable, run a program, and the program
doesn't have the variable available to it:
$ GODEBUG=...
$ go-program
[doesn't see your $GODEBUG setting]
It's also a source of mysterious failures, because more or less all
of the environment variables that are present automatically become
exported shell variables. So whether or not 'GODEBUG=..; echo
running program; go-program' works can depend on whether $GODEBUG
was already set when your shell started. The environment variables
of regular shell sessions are usually fairly predictable, but the
environment variables present when shell scripts get run can be
much more varied. This makes it easy to write a shell script that
only works right for you, because in your environment it runs with
certain environment variables set and so they automatically become
exported shell variables.
I've told you all of that because despite these pains, I believe
that the Bourne shell made the right choice here, in addition to a
pragmatically necessary choice at the time it was created, in V7
(Research) Unix. So let's start with the pragmatics.
The Bourne shell was created along side environment variables
themselves, and on the comparatively
small machines that V7 ran on, you didn't have much room for the
combination of program arguments and the new environment. If either
grew too big, you got 'argument list too long'
when you tried to run programs. This made it important to minimize
and control the size of the environment that the shell gave to new
processes. If you want to do that without limiting the use of shell
variables so much, a split between plain shell variables and exported
ones makes sense and requires only a minor bit of syntax (in the
form of 'export').
Both machines and exec() size limits are much larger now, so you
might think that getting rid of the distinction is a good thing.
The Bell Labs Research Unix people thought so, so they did do this
in Tom Duff's rc shell for V10 Unix and Plan 9. Having used both
the Bourne shell and a version of rc for
many years, I both agree and disagree with them.
For interactive use, having no distinction between shell variables
and exported shell variables is generally great. If I set $GODEBUG,
$PYTHONPATH, or any number of any other environment variables that
I want to affect programs I run, I don't have to remember to do a
special 'export' dance; it just works. This is a sufficiently
nice (and obvious) thing that it's an option for the POSIX 'sh',
in the form of 'set -a'
(and this set option is present in more or less all modern Bourne
shells, including Bash).
('Set -a' wasn't in the V7 sh, but
I haven't looked to see where it came from. I suspect that it may
have come from ksh, since POSIX took a lot of the specification for
their 'sh' from ksh.)
For shell scripting, however, not having a distinction is messy and
sometimes painful. If I write an rc script, every shell variable
that I use to keep track of something will leak into the environment
of programs that I run. The shell variables for intermediate results,
the shell variables for command line options, the shell variables
used for for loops, you name it, it all winds up in the environment
unless I go well out of my way to painfully scrub them all out. For
shell scripts, it's quite useful to have the Bourne shell's strong
distinction between ordinary shell variables, which are local to
your script, and exported shell variables, which you deliberately
act to make available to programs.
(This comes up for shell scripts and not for interactive use because
you commonly use a lot more shell variables in shell scripts than
you do in interactive sessions.)
For a new Unix shell today that's made primarily or almost entirely
for interactive use, automatically exporting shell variables into
the environment is probably the right choice. If you wanted to be
slightly more selective, you could make it so that shell variables
with upper case names are automatically exported and everything
else can be manually exported. But for a shell that's aimed at
scripting, you want to be able to control and limit variable scope,
only exporting things that you explicitly want to.
Well, it does (as far as I can tell, without deep testing). If you
want to limit how much of the system's memory people who log in can
use so that system services don't explode, you can set MemoryMin=
on system.slice to guarantee some amount of memory to it and all
things under it. Alternately, you can set MemoryMax=
on user.slice, collectively limiting all user sessions to that
amount of memory. In either case my view is that you might want to
set MemorySwapMax=
on user.slice so that user sessions don't spend all of their time
swapping. Which one you set things on depends on which is easier
and you trust more; my inclination is MemoryMax, although that
means you need to dynamically size it depending on this machine's
total memory.
(If you want to limit user memory use you'll need to make sure that
things like user cron jobs are forced into user sessions, rather than running under cron.service in
system.slice.)
Of course this is what you should expect, given systemd's documentation
and the kernel documentation.
On the other hand, the Linux kernel cgroup and memory system is
sufficiently opaque and ever changing that I feel the need to verify
that things actually do work (in our environment) as I expect them
to. Sometimes there are surprises, or settings that nominally work
but don't really affect things the way I expect.
This does raise the question of how much memory you want to reserve
for the system. It would be nice if you could use systemd-cgtop
to see how much memory your system.slice is currently using, but
unfortunately the number it will show is potentially misleadingly
high. This is because the memory attributed to any cgroup includes
(much) more than program RAM usage.
For example, on our it seems
typical for system.slice to be using under a gigabyte of 'user' RAM
but also several gigabytes of filesystem cache and other kernel
memory. You probably want to allow for some of that in what memory
you reserve for system.slice, but maybe not all of the current
usage.
(You can get the current version of the 'memdu' program I use
as memdu.py.)
Ah yes, GNOME, it is of course my mistake that I used gconf-editor
instead of dconf-editor. But at least now Gnome-Terminal no longer
intercepts F11, so I can possibly use g-t to enter F11 into serial
consoles to get the attention of a BIOS. If everything works in UEFI
land.
Gnome has had at least two settings systems, GSettings/dconf (also) and the older GConf. If you're using a modern
Gnome program, especially a standard Gnome program like gnome-terminal,
it will use GSettings and you will want to use dconf-editor
to modify its settings outside of whatever Preferences dialogs it
gives you (or doesn't give you). You can also use the gsettings or dconf programs from the command
line.
If the program you're using hasn't been updated to the latest things
that Gnome is doing, for example Thunderbird (at least as of 2024), then it will
still be using GConf. You need to edit its settings using
gconf-editor or gconftool-2, or possibly you'll need to look
at the GConf version of general Gnome settings. I don't know if
there's anything in Gnome that synchronizes general Gnome GSettings
settings into GConf settings for programs that haven't yet been
updated.
(This is relevant for programs, like Thunderbird, that use
general Gnome settings for things like 'how to open a particular
sort of thing'. Although I think modern Gnome may not have very
many settings for this because it always goes to the GTK GIO
system, based on the Arch Wiki's page
on Default Applications.)
Because I've made this mistake between gconf-editor and dconf-editor
more than once, I've now created a personal gconf-editor cover script
that prints an explanation of the situation when I run it without a
special --really argument. Hopefully this will keep me sorted out the
next time I run gconf-editor instead of dconf-editor.
PS: Probably I want to use gsettings instead of dconf-editor and
dconf as much as possible, since gsettings works through the
GSettings layer and so apparently has more safety checks than
dconf-editor and dconf do.
For reasons outside of the scope of this entry, I want to test how
various systemd memory resource limits
work and interact with each other (which means that I'm really
digging into cgroup v2 memory controls).
When I started trying to do this, it turned out that I had no good
test program (or programs), although I had some ones that gave me
partial answers.
There are two complexities in memory usage testing programs in a
cgroups environment. First, you may be able to allocate more memory
than you can actually use, depending on your system's settings for
strict overcommit. So it's not enough to see
how much memory you can allocate using the mechanism of your choice
(I tend to use mmap() rather than
go through language allocators). After you've either determined how
much memory you can allocate or allocated your target amount, you
have to at least force the kernel to materialize your memory by
writing something to every page of it. Since the kernel can probably
swap out some amount of your memory, you may need to keep repeatedly
reading all of it.
The second issue is that if you're not in strict overcommit (and
sometimes even if you are), the kernel
can let you allocate more memory than you can actually use and then
you try to use it, hit you with the OOM killer. For my testing, I
care about the actual usable amount of memory, not how much memory
I can allocate, so I need to deal with this somehow (and this is
where my current test programs are inadequate). Since the OOM killer
can't be caught by a process (that's sort of the point), the simple
approach is probably to have my test program progressively report
on how much memory its touched so far, so I can see how far it got
before it was OOM-killed. A more complex approach would be to do
the testing in a child process with progress reports back to the
parent so it could try to narrow in on how much it could use rather
than me guessing that I wanted progress reports every, say, 16
MBytes or 32 MBytes of memory touching.
(Hopefully the OOM killer would only kill the child and not the
parent, but with the OOM killer you can never be sure.)
I'm probably not the first person to have this sort of need, so I
suspect that other people have written test programs and maybe even
put them up somewhere. I don't expect to be able to find them in
today's ambient Internet search noise, plus this is very close to
the much more popular issue of testing your RAM memory.
(Will I put up my little test program when I hack it up? Probably
not, it's too much work to do it properly, with actual documentation
and so on. And these days I'm not very enthused about putting more
repositories on Github, so I'd need to find some alternate place.)
The original Bill Joy vi famously only had a single level of undo
(which is part of what makes it a product of its time). The 'u' command either undid your latest
change or it redid the change, undo'ing your undo. When POSIX and
the Single Unix Specification wrote vi into the standard, they
required this behavior; the vi
specification requires 'u' to work the same as it does in ex, where
it is specified as:
Reverse the changes made by the last command that modified the
contents of the edit buffer, including undo.
This is one particular piece of POSIX compliance that I think
everyone should ignore.
Vim and its
derivatives ignore the POSIX requirement and implement multi-level
undo and redo in the usual and relatively obvious way. The vim
'u' command only undoes changes but it can undo lots of them, and
to redo changes you use Ctrl-r ('r' and 'R' were already taken).
Because 'u' (and Ctrl-r) are regular commands they can be used with
counts, so you can undo the last 10 changes (or redo the last 10
undos). Vim can be set to vi compatible behavior if you want.
I believe that vim's multi-level undo and redo is the default
even when it's invoked as 'vi' in an unconfigured environment,
but I can't fully test that.
Nvi has opted to remain POSIX
compliant and operate in the traditional vi way, while still
supporting multi-level undo. To get multi-level undo in nvi, you
extend the first 'u' with '.' commands, so 'u..' undoes the most
recent three changes. The 'u' command can be extended with '.'
in either of its modes (undo'ing or redo'ing), so 'u..u..' is
a no-op. The '.' operation doesn't appear to take a count in nvi,
so there is no way to do multiple undos (or redos) in one action;
you have to step through them by hand. I'm not sure how nvi reacts
if you want do things like move your cursor position during an undo
or redo sequence (my limited testing suggests that it can perturb
the sequence, so that '.' now doesn't continue undoing or redoing
the way vim will continue if you use 'u' or Ctrl-r again).
The vi emulation package evil
for GNU Emacs inherits GNU Emacs' multi-level undo and nominally
binds undo and redo to 'u' and Ctrl-r respectively. However, I don't
understand its actual stock undo behavior. It appears to do multi-level
undo if you enter a sequence of 'u' commands and accepts a count
for that, but it feels not vi or vim compatible if you intersperse
'u' commands with things like cursor movement, and I don't understand
redo at all (evil has some customization settings for undo behavior,
especially evil-undo-system).
I haven't investigated Evil extensively and this undo and redo stuff
makes me less likely to try using it in the future.
The BusyBox implementation
of vi is minimal but it can be built with support for 'u' and
multi-level undo, which is done by repeatedly invoking 'u'. It
doesn't appear to have any redo support, which makes a certain
amount of sense in an environment when your biggest concern may be
reverting things so they're no worse than they started out. The
Ubuntu and Fedora versions of busybox appear to be built this way,
but your distance may vary on other Linuxes.
My personal view is that the vim undo and redo behavior is the best
and most human friendly option. Undo and redo are predictable and
you can predictably intersperse undo and redo operations with other
operations that don't modify the buffer, such as moving the cursor,
searching, and yanking portions of text. The nvi behavior essentially
creates a special additional undo mode, where you have to remember
that you're in a sequence of undo or redo operations and you can't
necessarily do other vi operations in the middle (such as cursor
movement, searches, or yanks). This matters a lot to me because I
routinely use multi-level undo when I'm writing text to rewind my
buffer to a previous state and yank out some wording that I've
decided I like better than its replacement.
(For additional vi versions, on the Fediverse, I was also
pointed tonextvi, which appears to use
vim's approach to undo and redo; I believe neatvi also does this but I can't
spot any obvious documentation on it. There are vi-inspired editors
such as vile and vis, but they're not things people
would normally use as a direct replacement for vi. I believe that
vile follows the nvi approach of 'u.' while vis follows the vim
model of 'uu' and Ctrl-r.)
I recently discovered a surprising path to accessing localhost
URLs and services, where
instead of connecting to 127.0.0.1 or the IPv6 equivalent, you
connected to 0.0.0.0 (or the IPv6 equivalent). In that entry I
mentioned that I didn't know if systemd's IPAddressDeny
would block this. I've now tested this, and the answer is that
systemd's restrictions do block this. If you set
'IPAddressDeny=localhost', the service or whatever is blocked from
the 0.0.0.0 variation as well (for both outbound and inbound
connections). This is exactly the way it should be, so you might
wonder why I was uncertain and felt I needed to test it.
There are a variety of ways at different levels that you might
implement access controls on a process (or a group of processes)
in Linux, for IP addresses or anything else. For example, you might
create an eBPF program that filtered the system calls and system
call arguments allowed and attach it to a process and all of its
children using seccomp(2).
Alternately, for filtering IP connections specifically, you might
use a cgroup socket address eBPF program
(also), which are among
the the cgroup program types
that are available. Or perhaps you'd prefer to use a cgroup socket
buffer program.
How a program such as systemd implements filtering has implications
for what sort of things it has to consider and know about when doing
the filtering. For example, if we reasonably conclude that the
kernel will have mapped 0.0.0.0 to 127.0.0.1 by the time it invokes
cgroup socket address eBPF programs, such a program doesn't need
to have any special handling to block access to localhost by people
using '0.0.0.0' as the target address to connect to. On the other
hand, if you're filtering at the system call level, the kernel has
almost certainly not done such mapping at the time it invokes you,
so your connect() filter had better know that '0.0.0.0' is equivalent
to 127.0.0.1 and it should block both.
This diversity is why I felt I couldn't be completely sure about
systemd's behavior without actually testing it. To be honest, I
didn't know what the specific options were until I researched them
for this entry. I knew systemd used eBPF for IPAddressDeny
(because it mentions that in the manual page in passing), but I vaguely
knew there are a lot of ways and places to use eBPF and I didn't know if
systemd's way needed to know about 0.0.0.0 or if systemd did know.
Sidebar: What systemd uses
As I found out through use of 'bpftoolcgroup list
/sys/fs/cgroup/<relevant thing>' on a systemd service that I knew
uses systemd IP address filtering, systemd uses cgroup socket
buffer programs, and
is presumably looking for good and bad IP addresses and netblocks
in those programs. This unfortunately means that it would be hard for
systemd to have different filtering for inbound connections as opposed
to outgoing connections, because at the socket buffer level it's all
packets.
Recently I saw another discussion of how some people are very
attached to the original, classical vi and its behaviors (cf).
I'm quite sympathetic to this view, since I too am very attached
to the idiosyncratic behavior of various programs I've gotten used
to (such as xterm's very specific behavior in various areas), but
at the same time I had a hot take over on the Fediverse:
Hot take: basic vim (without plugins) is mostly what vi should have
been in the first place, and much of the differences between vi
and vim are improvements. Multi-level undo and redo in an obvious
way? Windows for easier multi-file, cross-file operations? Yes please,
sign me up.
Basic vi is a product of its time, namely the early 1980s, and the
rather limited Unix machines of the time (yes a VAX 11/780 was
limited).
(The touches of vim superintelligence, not so much, and I turn them
off.)
For me, vim is a combination of genuine improvements in vi's core
editing behavior (cf), frustrating (to
me) bits of trying too hard to be smart (which I mostly disable
when I run across them), and an extension mechanism I ignore but
people use to make vim into a superintelligent editor with things
like LSP
integrations.
Some of the improvements and additions to vi's core editing may be
things that Bill Joy either didn't think of or didn't think were
important enough. However, I feel strongly that some or even many
of omitted features and differences are a product of the limited
environments vi had to operate in. The poster child for this is
vi's support of only a single level of undo, which drastically
constrains the potential memory requirements (and implementation
complexity) of undo, especially since a single editing operation
in vi can make sweeping changes across a large file (consider a
whole-file ':...s/../../' substitution, for example).
(The lack of split windows might be one part memory limitations and
one part that splitting an 80 by 24 serial terminal screen is much
less useful than splitting, say, an 80 by 50 terminal window.)
Vim isn't the only improved version of vi that has added features
like multi-level undo and split windows so you can see multiple
files at once (or several parts of the same file); there's also at
least nvi. I'm used to vim so I'm biased, but I happen to think
that a lot of vim's choices for things like multi-level undo are
good ones, ones that will be relatively obvious and natural to new
people and avoid various sorts of errors and accidents. But other
people like nvi and I'm not going to say they're wrong.
I do feel strongly that giving stock vi to anyone who doesn't
specifically ask for it is doing them a disservice, and this includes
installing stock vi as 'vi' on new Unix installs. At this point,
what new people are introduced to and what is the default on systems
should be something better and less limited than stock vi. Time has
moved on and Unix systems should move on with it.
(I have similar feelings about the default shell for new accounts
for people, as opposed to system accounts. Giving people bare Bourne
shell is not doing them any favours and is not likely to make a
good first impression. I don't care what you give them but it should
at least support cursor editing, file completion, and history, and
those should be on by default.)
PS: I have complicated feelings about Unixes that install stock vi
as 'vi' and something else under its full name, because on the one
hand that sounds okay but on the other hand there is so much stuff
out there that says to use 'vi' because that's the one name that's
universal. And if you then make 'vi' the name of the default (visual)
editor, well, it certainly feels like you're steering new people
into it and doing them a disservice.
(I don't expect to change the mind of any Unix that is still shipping
stock vi as 'vi'. They've made their cultural decisions a long time
ago and they're likely happy with the results.)
One of the important roles of Linux system package managers like
dpkg and RPM is providing a single
interface to building programs from source even though the programs
may use a wide assortment of build processes. One of the source
building features that both dpkg and RPM included (I believe from
the start) is patching the upstream source code, as well as providing
additional files along with it. My impression is that today this is
considered much less important in package managers, and some may make
it at least somewhat awkward to patch the source release on the fly.
Recently I realized that there may be a reason for this potential
oddity in dpkg and RPM.
Both dpkg and RPM are very old (by Linux standards). As covered in
Andrew Nesbitt's Package Manager Timeline, both
date from the mid-1990s (dpkg in January 1994, RPM in September
1995). Linux itself was quite new at the time and the Unix world
was still dominated by commercial Unixes (partly because the march
of x86 PCs was only just starting). As a
result, Linux was a minority target for a lot of general Unix free
software (although obviously not for Linux specific software). I
suspect that this was compounded by limitations in early Linux libc,
where apparently it had some issues with standards (see eg this,
also,
also,
also).
As a minority target, I suspect that Linux regularly had problems
compiling upstream software, and for various reasons not all upstreams
were interested in fixing (or changing) that (especially if it
involved accepting patches to cope with a non standards compliant
environment; one reply was to tell Linux to get standards compliant).
This probably left early Linux distributions regularly patching
software in order to make it build on (their) Linux, leading to
first class support for patching upstream source code in early
package managers.
(I don't know for sure because at that time I
wasn't using Linux or x86 PCs, and I might have been vaguely in the
incorrect 'Linux isn't Unix' camp. My first Linux came somewhat later.)
These days things have changed drastically. Linux is much more
standards compliant and of course it's a major platform. Free
software that works on non-Linux Unixes but doesn't build cleanly
on Linux is a rarity, so it's much easier to imagine (or have) a
package manager that is focused on building upstream source code
unaltered and where patching is uncommon and not as easy (or trivial)
as dpkg and RPM make it.
(You still need to be able to patch upstream releases to handle
security patches and so on, since projects don't necessarily publish
new releases for them. I believe some projects simply issue patches
and tell you to apply them to their current release. And you may
have to backport a patch yourself if you're sticking on an older
release of the project that they no longer do patches for.)
Today's other work achievement: getting a UEFI booted FreeBSD 15
machine to use a serial console on its second serial port, not its
first one. Why? Because the BMC's Serial over Lan stuff appears to be
hardwired to the second serial port, and life is too short to wire up
physical serial cables to test servers.
The basics of serial console support for your FreeBSD machine are
covered in the loader.conf manual page,
under the 'console' setting (in the 'Default Settings' section).
But between UEFI and FreeBSD's various consoles, things get
complicated, and for me the manual pages didn't do a great job of
putting the pieces together clearly. So I'll start with my descriptions
of all of the loader.conf variables that are relevant:
console="efi,comconsole"
Sets both the bootloader console and
the kernel console to both the EFI console and the serial port,
by default COM1 (ttyu0, Linux ttyS0). This is somewhat harmful if
your UEFI BIOS is already echoing console output to the serial
port (or at least to the serial port you want); you'll get doubled
serial output from the FreeBSD bootloader, but not doubled output
from the kernel.
boot_multicons="YES"
As covered in loader_simp(8),
this establishes multiple low level consoles for kernel messages.
It's not necessary if your UEFI BIOS is already echoing console
output to the serial port (and the bootloader and kernel can
recognize this), but it's harmless to set it just in case.
comconsole_speed="115200"
Sets the serial console speed
(and in theory 115200 is the default). It's not necessary if the
UEFI BIOS has set things up but it's harmless. See loader_simp(8)
again.
comconsole_port="0x2f8"
Sets the serial port used to COM2.
It's not necessary if the UEFI BIOS has set things up, but again
it's harmless. You can use 0x3f8 to specify COM1, although it's
the default. See loader_simp(8).
hw.uart.console="io:0x2f8,br:115200"
This tells the kernel
where the serial console is and what baud rate it's at, here COM2
and 115200 baud. The loader will automatically set it for you if
you set the comconsole_* variables, either because you also
need a 'console=' setting or because you're being redundant.
See loader.efi(8) (and
then loader_simp(8) and uart(4)).
(That the loader does this even without a 'comconsole' in your
nonexistent 'console=' line may some day be considered a bug and
fixed.)
If they agree with each other, you can safely set both hw.uart.console
and the comconsole_* variables.
On a system where the UEFI BIOS isn't echoing the UEFI console
output to a serial port, the basic version of FreeBSD using both
the video console (settings for which are in vt(4)) and the
serial console (on the default of COM1), with the primary being the
video console, is a loader.conf setting of:
console="efi,comconsole"
boot_multicons="YES"
This will change both the bootloader console and the kernel console
after boot. If your UEFI BIOS is already echoing 'console' output
to the serial port, bootloader output will be doubled and you'll
get to see fun bootloader output like:
If you see this (or already know that your UEFI BIOS is doing this),
the minimal alternate loader.conf settings (for COM1) are:
# for COM1 / ttyu0
hw.uart.console="io:0x3f8,br:115200"
(The details are covered in loader.efi(8)'s
discussion of console considerations.)
If you don't need a 'console=' setting because of your UEFI BIOS,
you must set either hw.uart.console or the comconsole_*
settings. Technically, setting hw.uart.console is the correct
approach; that setting only comconsole_* still works may be a
bug.
If you don't explicitly set a serial port to use, FreeBSD will use
COM1 (ttyu0, Linux ttyS0) for the bootloader and kernel. This is
only possible if you're using 'console=', because otherwise you
have to directly or indirectly set 'hw.uart.console', which directly
tells the kernel which serial port to use (and the bootloader will
use whatever UEFI tells it to). To change the serial port to COM2,
you need to set the appropriate one of 'comconsole_port' and
'hw.uart.console' from 0x3f8 (COM1) to the right PC port value
of 0x2f8.
So our more or less final COM2 /boot/loader.conf for a case where
you can turn off or ignore the BIOS echoing to the serial console
is:
console="efi,comconsole"
boot_multicons="YES"
comconsole_speed="115200"
# For the COM2 case
comconsole_port="0x2f8"
If your UEFI BIOS is already echoing 'console' output to the serial
port, the minimal version of the above (again for COM2) is:
# For the COM2 case
hw.uart.console="io:0x2f8,br:115200"
(As with Linux, the FreeBSD kernel will only use one serial port
as the serial console; you can't send kernel messages to two serial
ports. FreeBSD at least makes this explicit in its settings.)
As covered in conscontrol and elsewhere,
FreeBSD has a high level console, represented by /dev/console,
and a low level console, used directly by the kernel for things
like kernel messages. The high level console can only go to one
device, normally the first one; this is either the first one in
your 'console=' line or whatever UEFI considers the primary
console. The low level console can go to multiple devices. Unlike
Linux, this can be changed on the fly once the system is up through
conscontrol (and also have its state checked).
Conveniently, you don't need to do anything to start a serial login
on your chosen console serial port. All four possible (PC) serial
ports, /dev/ttyu0 through /dev/ttyu3, come pre-set in /etc/ttys
with 'onifconsole' (and 'secure'), so that if the kernel is using
one of them, there's a getty started on it. I haven't tested what
happens if you use conscontrol to change the console on the
fly.
Booting FreeBSD on a UEFI based system is covered through the manual
page series of uefi(8), boot(8),
loader.efi(8), and loader(8). It's
not clear to me if loader.efi is the EFI specific version of
loader(8), or if the one loads and starts the other in a multi-stage
boot process. I suspect it's the former.
Sidebar: What we may wind up with in loader.conf
Here's what I think is a generic commented block for serial console
support:
# Uncomment if the UEFI BIOS does not echo to serial port
#console="efi,comconsole"
boot_multicons="YES"
comconsole_speed="115200"
# Uncomment for COM2
#comconsole_port="0x2f8"
# change 0x3f8 (COM1) to 0x2f8 for COM2
hw.uart.console="io:0x3f8,br:115200"
All of this works for me on FreeBSD 15, but your distance may vary.
The abstract way to describe why
is to say that Linux distributions had to assemble a whole thing
from separate pieces; the kernel came from one place, libc from
another, coreutils from a third, and so on. The concrete version
is to think about what problems you'd have without a package manager.
Suppose that you assembled a directory tree of all of the source
code of the kernel, libc, coreutils, GCC, and so on. Now you need
to build all of these things (or rebuild, let's ignore bootstrapping
for the moment).
Building everything is complicated partly because everything goes
about it differently. The kernel has its own configuration and build
system, a variety of things use autoconf but not necessarily with
the same set of options to control things like features, GCC has a
multi-stage build process, Perl has its own configuration and
bootstrapping process, X is frankly weird and vaguely terrifying,
and so on. Then not everyone uses 'make install' to actually install
their software, so you have another set of variations for all of
this.
(The less said about the build processes for either TeX or GNU
Emacs in the early to mid 1990s, the better.)
If you do this at any scale, you need to keep track of all of this
information (cf) and you want
a uniform interface for 'turn this piece into a compiled and ready
to unpack blob'. That is, you want a source package (which encapsulates all of the 'how to do
it' knowledge) and a command that takes a source package and does
a build with it. Once you're building things that you can turn into
blobs, it's simpler to always ship a new version of the blob
whenever you change anything.
(You want the 'install' part of 'build and install' to result in a
blob rather than directly installing things on your running system
because until it finishes, you're not entirely sure the build and
install has fully worked. Also, this gives you an easy way to split
overall system up into multiple pieces, some of which people don't
have to install. And in the very early days, to split them across
multiple floppy disks, as SLS did.)
Now you almost have a system package manager with source packages
and binary packages. You're building all of the pieces of your Linux
distribution in a standard way from something that looks a lot like
source packages, and you pretty much want to create binary blobs
from them rather than dump everything into a filesystem. People
will obviously want a command that takes a binary blob and 'installs'
it by unpacking it on their system (and possibly extra stuff),
rather than having to run 'tar whatever' all the time themselves,
and they'll also want to automatically keep track of which of your
packages they've installed rather than having to keep their own
records. Now you have all of the essential parts of a system package
manager.
(Both dpkg and RPM also keep track of which package installed what
files, which is important for upgrading and removing packages, along
with things having versions.)
We have a collection of VPN
servers, some OpenVPN based and some L2TP based. They used to be
based on OpenBSD, but we're moving from OpenBSD to FreeBSD and the VPN servers recently
moved too. We also have a system for collecting Prometheus metrics
on VPN usage, which worked by
parsing the output of things.
For OpenVPN, our scripts just kept working when we switched to
FreeBSD because the two OSes use basically the same OpenVPN setup.
This was not the case for our L2TP VPN server.
OpenBSD does L2TP using npppd,
which supports a handy command line control program, npppctl, that can readily extract and
report status information. On FreeBSD, we wound up using mpd5. Unfortunately,
mpd5 has no equivalent of npppctl. Instead, as covered (sort of)
in its user manual
you get your choice of a TCP based console that's clearly intended
for interactive use and a web interface that is also sort of
intended for interactive use (and isn't all that well documented).
Fortunately, one convenient thing about the web interface is that
it uses HTTP Basic authentication, which means that you can easily
talk to it through tools like curl. To do status
scraping through the web interface, first you need to turn it on
and then you need an unprivileged mpd5 user you'll use for this:
set web self 127.0.0.1 5006
set web open
set user metrics <some-password> user
At this point you can use curl to get responses from the mpd5
web server (from the local host, ie your VPN server itself):
There are two useful things you can ask the web server interface
for. First, you can ask it for a complete dump of its status in
JSON format, by asking for 'http://localhost:5006/json' (although
the documentation claims that the information returned is what 'show
summary' in the console would give you, it is more than that). If
you understand mpd5 and like parsing and processing JSON, this is
probably a good option. We did not opt to do this.
The other option is that you can ask the web interface to run console
(interface) commands for you, and then give you the output in either
a 'pleasant' HTML page or in a basic plain text version. This is
done by requesting either '/cmd?<command>' or '/bincmd?<command>'
respectively. For statistics scraping, the most useful version is
the 'bincmd' one, and the command we used is 'show session':
(I assume 'RESULT: 0' would be something else if there was some
sort of problem.)
Of these, the useful fields for us are the first, which gives the
local network device, the second, which gives the internal VPN IP
of this connection, and the last two, which give us the VPN user
and their remote IP. The others are internal MPD things that we
(hopefully) don't have to care about. The internal VPN IP isn't
necessary for (our) metrics but may be useful for log correlation.
To get traffic volume information, you need to extract the usage
information from each local network device that a L2TP session is
using (ie, 'ng1' and its friends). As far as I know, the only tool
for this in (base) FreeBSD is netstat. Although you can
invoke it interface by interface, probably the better thing to do
(and what we did) is to use 'netstat -ibn -f link' to dump
everything at once and then pick through the output to get the lines
that give you packet and byte counts for each L2TP interface, such
as ng1 here.
(I'm not sure if dropped packets is relevant for these interfaces;
if you think it might be, you want 'netstat -ibnd -f link'.)
FreeBSD has a general system, 'libxo', for producing output from
many commands in a variety of handy formats. As covered in
xo_options,
this can be used to get this netstat output in JSON if you find
that more convenient. I opted to get the plain text format and use
field numbers for the information I wanted for our VPN traffic
metrics.
(Partly this was because I could ultimately reuse a lot of my metrics
generation tools from the OpenBSD npppctl parsing. Both environments
generated two sets of line and field based information, so a
significant amount of the work was merely shuffling around which
field was used for what.)
PS: Because of how mpd5 behaves, my view is that you don't want to
let anyone but system staff log on to the server where you're using
it. It is an old C code base and I would not trust it if people can
hammer on its TCP console or its web server. I certainly wouldn't
expose the web server to a non-localhost network, even apart from
the bit where it definitely doesn't support HTTPS.
The news of the time interval is that Linux's usual telnetd has
had a giant security vulnerability for a decade. As
people on the Fediverse observed, we've been here before; Solaris
apparently had a similar bug 20 or so years ago (which was
CVE-2007-0882, cf,
via), and AIX in
the mid 1990s (CVE-1999-0113, source, also)), and also apparently
SGI Irix, and no doubt many others (eg). It's not necessarily
telnetd at fault, either, as I believe it's sometimes been rlogind.
All of these bugs have a simple underlying cause; in a way that
root cause is people using Unix correctly and according to its
virtue of modularity, where each program does one thing and you
string programs together to achieve your goal. Telnetd and rlogind
have the already complicated job of talking a protocol to the
network, setting up ptys, and so on, so obviously they should leave
the also complex job of logging the user in to login, which already
exists to do that. In theory this should work fine.
The problem with this is that from more or less the beginning, login
has had several versions of its job. From no later than V3 in 1972,
login
could also be used to switch from one user to another, not just log
in initially. In 4.2 BSD, login
was modified and reused to become part of rlogind's authentication
mechanism (really; .rhosts is checked in the 4.2BSD login.c,
not in rlogind). Later, various versions of login were modified to
support 'automatic' logins, without challenging for a password (see
eg FreeBSD login(1),
OpenBSD login(1), and Linux
login(1);
use of -f for this appears to date back to around 4.3 Tahoe).
Sometimes this was explicitly for the use of things that were running
as root and had already authenticated the login.
In theory this is all perfectly Unixy. In practice, login figured
out which of these variations of its basic job it was being used
for based on a combination of command line arguments and what UID
it was running as, which made it absolutely critical that programs
running as root that reused login never allowed login to be invoked
with arguments that would shift it to a different mode than they
expected. Telnetd and rlogind have traditionally run as root,
creating this exposure.
People are fallible, programmers included, and attackers are very
ingenious. Over the years any number of people have found any number
of ways to trick network daemons running as root into running login
with 'bad' arguments.
The one daemon I don't think has ever been tricked this way is
OpenSSH, because from very early on sshd refused to delegate logging
people in to login. Instead, sshd has its own code to log people
in to the system. This has had its complexities but has also shielded
sshd from all of these (login) context problems.
In my view, this is one of the unfortunate times when the ideals
of Unix run up against the uncomfortable realities of the world.
Network daemons delegating logging people in to login is the
correct Unix answer, but in practice it has repeatedly gone wrong
and the best answer is OpenSSH's.
Except, if you look at an actual VLAN configuration as materialized
by Netplan (or written out by hand), you'll discover a problem.
Your VLANs don't normally have .link files, only .netdev
and .network
files (and even your normal Ethernet links may not have .link files).
The AlternativeName= setting is only valid in .link files, because
networkd is like that.
(The AlternativeName= is a '[Link]' section setting and
.network files also have a '[Link]' section, but they allow
completely different sets of '[Link]' settings. The .netdev file,
which is where you define virtual interfaces, doesn't have a '[Link]'
section at all, although settings like AlternativeName= apply
to them just as much as to regular devices. Alternately, .netdev
files could support setting altnames for virtual devices in the
'[NetDev]' section along side the mandatory 'Name=' setting.)
You can work around this indirectly, because you can create a .link
file for a virtual network device and have it work:
Networkd does the right thing here even though 'vlan22-mlab' doesn't
exist when it starts up; when vlan22-mlab comes into existence, it
matches the .link file and has the altname stapled on.
Given how awkward this is (and that not everything accepts or sees
altnames), I think it's probably not worth bothering with unless
you have a very compelling reason to give an altname to a virtual
interface. In my case, this is clearly too much work simply to give
a VLAN interface its 'proper' name.
Since I tested, I can also say that this works on a Netplan-based
Ubuntu server where the underlying VLAN is specified in Netplan.
You have to hand write the .link file and stick it in /etc/systemd/network,
but after that it cooperates reasonably well with a Netplan VLAN
setup.
This is my (sad) face that Linux interfaces have a maximum
name length. What do you mean I can't call this VLAN interface
'vlan22-matterlab'?
Also, this is my annoyed face that Canonical Netplan doesn't check
or report this problem/restriction. Instead your VLAN interface just
doesn't get created, and you have to go look at system logs to find
systemd-networkd telling you about it.
(This is my face about Netplan in general, of course. The sooner it
gets yeeted the better.)
Based on both some Internet searches and looking at kernel headers,
I believe the limit is 15 characters for the primary name of an
interface. In headers, you will find this called IFNAMSIZ (the
kernel) or IF_NAMESIZE (glibc), and it's defined to be 16 but
that includes the trailing zero byte for C strings.
(I can be confident that the limit is 15, not 16, because
'vlan22-matterlab' is exactly 16 characters long without a trailing
zero byte. Take one character off and it works.)
At the level of ip
commands, the error message you get is on the unhelpful side:
# ip link add dev vlan22-matterlab type wireguard
Error: Attribute failed policy validation.
(I picked the type for illustration purposes.)
Systemd-networkd gives you a much better error message:
/run/systemd/network/10-netplan-vlan22-matterlab.netdev:2: Interface name is not valid or too long, ignoring assignment: vlan22-matterlab
(Then you get some additional errors because there's no name.)
As mentioned in my Fediverse post, Netplan tells
you nothing. One direct consequence of this is that in any context
where you're writing down your own network interface names, such
as VLANs or WireGuard interfaces, simply having 'netplan try' or 'netplan
apply' succeed without errors does not mean that your configuration
actually works. You'll need to look at error logs and perhaps
inventory all your network devices.
As covered in the ip link manual
page, network interfaces can have either or both of aliases and
'altname' properties. These alternate names can be (much) longer
than 16 characters, and the 'ip link property' altname property can
be used in various contexts to make things convenient (I'm not sure
what good aliases are, though). However this is somewhat irrelevant
for people using Netplan, because the current Netplan YAML doesn't
allow you to set interface altnames.
You can set altnames in networkd .link files, as covered in the
systemd.link
manual page. The direct thing you want is AlternativeName=,
but apparently you may also want to set a blank alternative names
policy, AlternativeNamesPolicy=.
Of course this probably only helps if you're using systemd-networkd
directly, instead of through Netplan.
PS: Netplan itself has the notion of Ethernet interfaces having
symbolic names, such as 'vlanif0', but this is purely internal to
Netplan; it's not manifested as an actual interface altname in the
'rendered' systemd-networkd control files that Netplan writes out.
Netplan is Canonical's more or less mandatory
method of specifying networking on Ubuntu. Netplan has a collection of
limitations and irritations, and recently I ran into a new one, which
is how VLANs can and can't be specified. To explain this, I can start
with the YAML configuration language. To quote
the top level version, it looks like:
To translate this, you specify VLANs separately from your Ethernet or
other networking devices. On the one hand, this is nicely flexible. On
the other hand it creates a problem, because here is what you have
to write for VLAN properties:
Every VLAN is on top of some networking device, and because VLANs
are specified as a separate category of top level devices, you have
to name the underlying device in every VLAN (which gets very annoying
and old very fast if you have ten or twenty VLANs to specify). Did
you decide to switch from a 1G network port to a 10G network port
for the link with all of your VLANs on it? Congratulations, you get
to go through every 'vlans:' entry and change its 'link:' value.
We hope you don't overlook one.
(Or perhaps you had to move the system disks from one model of 1U
server to another model of 1U server because the hardware failed.
Or you would just like to write generic install instructions with
a generic block of YAML that people can insert directly.)
The best way for Netplan to deal with this would be to allow you
to also specify VLANs as part of other devices, especially Ethernet
devices. Then you could write:
Every VLAN specified in enp5s0's configuration would implicitly use
enp5s0 as its underlying link device, and you could rename all of
them trivially. This also matches how I think most people think of
and deal with VLANs, which is that (obviously) they're tied to some
underlying device, and you want to think of them as 'children' of
the other device.
(You can have an approach to VLANs where they're more free-floating
and the interface that delivers any specific VLAN to your server
can change, for load balancing or whatever. But you could still do
this, since Netplan will need to keep supporting the separate
'vlans:' section.)
If you want to work around this today, you have to go for the far
less convenient approach of artificial network names.
This way you only need to change one thing if your VLAN network
interface changes, but at the cost of doing a non-standard way of
setting up the base interface. (Yes, Netplan accepts it, but it's
not how the Ubuntu installer will create your netplan files and who
knows what other Canonical tools will have a problem with it as a
result.)
We have one future Ubuntu server where we're going to need to set
up a lot of VLANs on one underlying physical interface. I'm not
sure which option we're going to pick, but the 'vlanif0' option
is certainly tempting. If nothing else, it probably means we can
put all of the VLANs into a separate, generic Netplan file.
Current status: doing extremely "I don't know what I'm really doing,
I'm copying from a websiteΒΉ" things with Linux tc to see if I can
improve my home Internet latency under load without doing too much
damage to bandwidth or breaking my firewall rules. So far, it seems to
work and thingsΒ² claim to like the result.
What started this was running into a Fediverse post about the
bufferbloat test,
trying it, and discovering that (as expected) my home DSL link
performed badly, with significant increased latency during downloads,
uploads, or both. My memory is that reported figures went up to the
area of 400 milliseconds.
Conveniently for me, my Linux home desktop is also my DSL router;
it speaks PPPoE directly through my DSL modem. This means that doing
traffic shaping on my Linux desktop should cover everything, without
any need to wrestle with a limited router OS environment. And there
was some more or less cut and paste directions on the site.
So my outbound configuration was simple and obviously not harmful:
tc qdisc add root dev ppp0 cake bandwidth 7.6Mbit
The bandwidth is a guess, although one informed by checking both
my raw DSL line rate and what testing sites told me.
The inbound configuration was copied from the documentation and
it's where I don't understand what I'm doing:
ip link add name ifb4ppp0 type ifb
tc qdisc add dev ppp0 handle ffff: ingress
tc qdisc add dev ifb4ppp0 root cake bandwidth 40Mbit besteffort
ip link set ifb4ppp0 up
tc filter add dev ppp0 parent ffff: matchall action mirred egress redirect dev ifb4ppp0
Here is what I understand about this. As covered in the tc manual page,
traffic shaping and scheduling happens only on 'egress', which is
to say for outbound traffic. To handle inbound traffic, we need a
level of indirection to a special ifb (Intermediate Functional
Block)
(also) device,
that is apparently used only for our (inbound) tc qdisc.
So we have two pieces. The first is the actual traffic shaping on
the IFB link, ifb4ppp0, and setting the link 'up' so that it will
actually handle traffic instead of throw it away. The second is
that we have to push inbound traffic on ppp0 through ifb4ppp0 to
get its traffic shaping. To do this we add a special 'ingress' qdisc
to ppp0, which applies to inbound traffic, and then we use a tc
filter that matches all (ingress) traffic and
redirects it to ifb4ppp0 as 'egress' traffic. Since
it's now egress traffic, the tc shaping on ifb4ppp0 will now apply
to it and do things.
When I set this up I wasn't certain if it was going to break my
non-trivial firewall rules on the ppp0 interface. However, everything
seems to fine, and the only thing the tc redirect is affecting is
traffic shaping. My firewall blocks and NAT rules are still working.
Applying these tc rules definitely improved my latency scores on
the test site; my link went
from an F rating to an A rating (and a C rating for downloads and
uploads happening at once). Does this improve my latency in practice
for things like interactive SSH connections while downloads and
uploads are happening? It's hard for me to tell, partly because I
don't do such downloads and uploads very often, especially while I'm
doing interactive stuff over SSH.
(Of course partly this is because I've sort of conditioned myself
out of trying to do interactive SSH while other things are happening
on my DSL link.)
The most I can say is that this probably improves things, and that
since my DSL connection has drifted into having relatively bad
latency to start with (by my standards), it probably helps to
minimize how much worse it gets under load.
I do seem to get slightly less bandwidth for transfers than I did
before; experimentation says that how much less can be fiddled with
by adjusting the tc 'bandwidth' settings, although that also changes
latency (more bandwidth creates worse latency). Given that I rarely
do large downloads or uploads, I'm willing to trade off slightly
lower bandwidth for (much) less of a latency hit. One reason that
my bandwidth numbers are approximate anyway is that I'm not sure
how much PPPoE DSL framing compensation I need.
Sidebar: A rewritten command order for ingress traffic
If my understanding is correct, we can rewrite the commands to set
up inbound traffic shaping to be more clearly ordered:
# Create and enable ifb link
ip link add name ifb4ppp0 type ifb
ip link set ifb4ppp0 up
# Set CAKE with bandwidth limits for
# our actual shaping, on ifb link.
tc qdisc add dev ifb4ppp0 root cake bandwidth 40Mbit besteffort
# Wire ifb link (with tc shaping) to inbound
# ppp0 traffic.
tc qdisc add dev ppp0 handle ffff: ingress
tc filter add dev ppp0 parent ffff: matchall action mirred egress redirect dev ifb4ppp0
The 'ifb4ppp0' name is arbitrary but conventional, set up as
'ifb4<whatever>'.
When I described my current ideal Linux source package format, I said that it should be embedded in
the source code of the software being packaged. In a comment,
bitprophet had a perfectly reasonable
and good preference the other way:
Re: other points: all else equal I think I vaguely prefer the Arch
"repo contains just the extras/instructions + a reference to the
upstream source" approach as it's cleaner overall, and makes it easier
to do "more often than it ought to be" cursed things like "apply
some form of newer packaging instructions against an older upstream
version" (or vice versa).
The Arch approach is isomorphic to the source RPM format, which has
various extras and instructions plus a pre-downloaded set of upstream
sources. It's not really isomorphic to the Debian source format
because you don't normally work with the split up version; the split
up version is just a package distribution thing (as dgit shows).
(I believe the Arch approach is also how the FreeBSD and OpenBSD
ports trees work. Also, the source package format you work in is
not necessarily how you bundle up and distribute source packages,
again as shown by Debian.)
Let's call these two packaging options the inline approach (Debian)
and the out of line approach (Arch, RPM). My view is that which
one you want depends on what you want to do with software and
packages. The out of line approach makes it easier to build unmodified
packages, and as bitprophet comments it's easy to do weird build
things. If you start from a standard template for the type of build
and install the software uses, you can practically write the packaging
instructions yourself. And the files you need to keep are quite
compact (and if you want, it's relatively easy to put a bunch of
them into a single VCS repository, each in its own subdirectory).
However, the out of line approach makes modifying upstream software
much more difficult than a good version of the inline approach (such
as, for example, dgit). To modify upstream
software in the out of line approach you have to go through some
process similar to what you'd do in the inline approach, and then
turn your modifications into patches that your packaging instructions
apply on top of the pristine upstream. Moving changes from version
to version may be painful in various ways, and in addition to those
nice compact out of line 'extras/instructions' package repos, you
may want to keep around your full VCS work tree that you built the
patches from.
(Out of line versus inline is a separate issue from whether or not
the upstream source code should include packaging instructions in
any form; I think that generally the upstream should not.)
As a system administrator, I'm biased toward easy modification
of upstream packages and thus upstream source
because that's most of why I need to build my own packages. However,
these days I'm not sure if that's what a Linux distribution should
be focusing on. This is especially true for 'rolling' distributions
that mostly deal with security issues and bugs not by patching their
own version of the software but by moving to a new upstream version
that has the security fix or bug fix. If most of what a distribution
packages is unmodified from the upstream version, optimizing for
that in your (working) source package format is perfectly sensible.
A related thing I've taken to doing before potential lurching changes
(like Linux distribution upgrades) is to take screenshots and window
images. Because comparing a now and then image is a heck of a lot
easier than restoring backups, and I can look at it repeatedly as I
fix things on the new setup.
Linux distributions and the software they package have a long history
of deciding to change things for your own good. They will tinker
with font choices, font sizes, default DPI determinations, the size
of UI elements, and so on, not quite at the drop of a hat but
definitely when you do something like upgrade your distribution and
bring in a bunch of significant package version changes (and new
programs to replace old programs).
Some people are perfectly okay with these changes. Other people,
like me, are quite attached to the specifics of how their current
desktop environment looks and will notice and be unhappy about even
relatively small changes (eg,
also). However, because we're fallible
humans, people like me can't always recognize exactly what changed
and remember exactly what the old version looked like (these two
are related); instead, sometimes all we have is the sense that
something changed but we're not quite sure exactly what or exactly
how.
Screenshots and window images are the fix for that unspecific
feeling. Has something changed? You can call up an old screenshot
to check, and to example what (and then maybe work out how to reverse
it, or decide to live with the change). Screenshots aren't perfect;
for example, they won't necessarily tell you what the old fonts
were called or what sizes were being used. But they're a lot better
than trying to rely on memory or other options.
It would probably also do me good to get into the habit of taking
screenshots periodically, even outside of distribution upgrades.
Looking back over time every so often is potentially useful to see
more subtle, more long term changes, and perhaps ask myself either
why I'm not doing something any more or why I'm still doing it.
(Currently I'm somewhat lackadasical about taking screenshots even
before distribution upgrades. I have a distribution upgrade process
but I haven't made screenshots part of it, and I don't have an
explicit checklist for the process. Which I definitely should
create. Possibly I should also
try to capture font information in text form, to the extent that
I can find it.)
I've written recently on why source packages are complicated and why packages should be declarative (in contrast to Arch style shell
scripts), but I haven't said anything about what I'd like in a
source package format, which will mostly be from the perspective
of a system administrator who sometimes needs to modify upstream
packages or package things myself.
A source package format is a compromise. After my recent experiences
with dgit, I now feel that the best
option is that a source package is a VCS repository directory tree
(Git by default) with special control files in a subdirectory.
Normally this will be the upstream VCS repository with packaging
control files and any local changes merged in as VCS commits. You
perform normal builds in this checked out repository, which has the
advantage of convenience and the disadvantage that you have to clean
up the result, possibly with liberal use of 'git clean' and 'git
reset'. Hermetic builds are done by some tool that copies the checked
out files to a build area, or clones the repository, or some other
option. If a binary package is built in an environment where this
information is available, its metadata should include the exact
current VCS commit it was built from, and I would make binary
packages not build if there were uncommitted changes.
(Making the native source package a VCS tree with all of the source
code makes it easy to work on but mingles package control files
with the program source. In today's environment with good distributed
VCSes I think this is the right tradeoff.)
The control files should be as declarative as possible, and they
should directly express major package metadata such as version
numbers (unlike the Debian package format, where the version number
is derived from debian/changelog). There should be a changelog but
it should be relatively free-form, like RPM changelogs. Changelogs
are especially useful for local modifications because they go along
with the installed binary package, which means that you can get an
answer to 'what did we change in this locally modified package'
without having to find your source. The main metadata file that
controls everything should be kept simple; I would go as far as to
say it should have a format that doesn't allow for multi-line
strings, and anything that requires multi-line strings should go
in additional separate files (including the package description).
You could make it TOML but I don't think you should make it YAML.
Both the build time actions, such as configuring and compiling
the source, and the binary package install time actions should by
default be declarative; you should be able to say 'this is an
autoconf based program and it should have the following additional
options', and the build system will take care of everything else.
Similarly you should be able to directly express that the binary
package needs certain standard things done when it's installed,
like adding system users and enabling services. However, this will
never be enough so you should also be able to express additional
shell script level things that are done to prepare, build, install,
upgrade, and so on the package. Unlike RPM and Debian source packages
but somewhat like Arch packages, these should be separate files in
the control directory, eg 'pkgmeta/build.sh'. Making these separate
files makes it much easier to do things like run shellcheck on them
or edit them in syntax-aware editor environments.
(It should be possible to combine standard declarative prepare and
build actions with additional shell or other language scripting.
We want people to be able to do as much as possible with standard,
declarative things. Also, although I used '.sh', you should be able
to write these actions in other languages too, such as Python or
Perl.)
I feel that like RPMs, you should have to at least default to
explicitly declaring what files and directories are included in the
binary package. Like RPMs, these installed files should be analyzed
to determine the binary package dependencies rather than force you
to try to declare them in the (source) package metadata (although
you'll always have to declare build dependencies in the source
package metadata). Like build and install scripts, these file lists
should be in separate files, not in the main package metadata file.
The RPM collection of magic ways to declare file locations is complex
but useful so that, for example, you don't have to keep editing
your file lists when the Python version changes. I also feel that
you should have to specifically mark files in the file lists with
unusual permissions, such as setuid or setgid bits.
The natural way to start packing something new in this system would
be to clone its repository and then start adding the package control
files. The packaging system could make this easier by having
additional tools that you ran in the root of your just-cloned
repository and looked around to find indications of things like the
name, the version (based on repository tags), the build system in
use, and so on, and then wrote out preliminary versions of the
control files. More tools could be used incrementally for things
like generating the file lists; you'd run the build and 'install'
process, then have a tool inventory the installed files for you
(and in the process it could recognize places where it should change
absolute paths into specially encoded ones for things like 'the
current Python package location').
This sketch leaves a lot of questions open, such as what 'source
packages' should look like when published by distributions. One
answer is to publish the VCS repository but that's potentially quite
heavyweight, so you might want a more minimal form. However, once
you create a 'source only' minimal form without the VCS history,
you're going to want a way to disentangle your local changes from
the upstream source.
A commentator on my entry on why Debian and RPM (source) packages
are complicated suggested looking at Arch
Linux packaging, where most of the information is in a single file
as more or less a shell script (example).
Unfortunately, I'm not a fan of this sort of shell script or shell
script like format, ultimately because it's only declarative by
convention (although I suspect Arch enforces some of those conventions).
One reason that declarative formats are important is that you can
analyze and understand what they do without having to execute code.
Another reason is that such formats naturally standardize things,
which makes it much more likely that any divergence from the standard
approach is something that matters, instead of a style difference.
Being able to analyze and manipulate declarative (source) packaging
is useful for large scale changes within a distribution. The RPM
source package format uses standard,
more or less declarative macros to build most software, which I
understand has made it relatively easy to build a lot of software
with special C and C++ hardening options. You can inject similar
things into a shell script based environment, but then you wind up
with ad-hoc looking modifications in some circumstances, as we
see in the Dovecot example.
Some things about declarative source packages versus Arch style
minimalism are issues of what could be called 'hygiene'. RPM packages
push you to list and categorize what files will be included in the
built binary package, rather than simply assuming that everything
installed into a scratch hierarchy should be packaged. This can be
frustrating (and there are shortcuts), but it does give you a chance
to avoid accidentally shipping unintended files. You could do this
with shell script style minimal packaging if you wanted to, of
course. Both RPM and Debian packages have standard and relatively
declarative ways to modify a pristine upstream package, and while
you can do that in Arch packages, it's not declarative, which hampers
various sorts of things.
Basically my feeling is that at scale, you're likely to wind up
with something that's essentially as formulaic as a declarative
source package format without having its assured benefits. There
will be standard templates that everyone is supposed to follow and
they mostly will, and you'll be able to mostly analyze the result,
and that 'mostly' qualification will be quietly annoying.
(On the positive side, the Arch package format does let you run
shellcheck on your shell stanzas, which isn't straightforward to
do in the RPM source format.)
A commentator on my early notes on dgit
mentioned that they found packaging in Debian overly complicated
(and I think perhaps RPMs as well) and would rather build and ship
a container. On the one hand, this is in a way fair; my impression
is that the process of specifying and building a container is rather
easier than for source packages. On the other hand, Debian and RPM
source packages are complicated for good reasons.
Any reasonably capable source package format needs to contain a
number of things. A source package needs to supply the original
upstream source code, some amount of distribution changes, instructions
for building and 'installing' the source, a list of (some) dependencies
(for either or both build time and install time), a list of files
and directories it packages, and possibly additional instructions
for things to do when the binary package is installed (such as
creating users, enabling services, and so on). Then generally you
need some system for 'hermetic' builds,
ones that don't depend on things in your local (Linux) login
environment. You'll also want some amount of metadata to go with
the package, like a name, a version number, and a description. Good
source package formats also support building multiple binary packages
from a single source package, because sometimes you want to split
up the built binary files to reduce the amount of stuff some people
have to install. A built binary package contains a subset of this;
it has (at least) the metadata, the dependencies, a file list, all
of the files in the file list, and those install and upgrade time
instructions.
Built containers are a self contained blob plus some metadata. You
don't need file lists or dependencies or install and removal actions
because all of those are about interaction with the rest of the
system and by design containers don't interact with the rest of the
system. To build a container you still need some of the same
information that a source package has, but you need less and it's
deliberately more self-contained and freeform. Since the built
container is a self contained artifact you don't need a file list,
I believe it's uncommon to modify upstream source code as part of
the container build process (instead you patch it in advance in
your local repository), and your addition of users, activation of
services, and so on is mostly free form and at container build time;
once built the container is supposed to be ready to go. And my
impression is that in practice people mostly don't try to do things
like multiple UIDs in a single container.
(You may still want or need to understand what things you install
where in the container image, but that's your problem to keep track
of; the container format itself only needs a little bit of information
from you.)
Containers have also learned from source packages in that they can
be layered, which is to say that you can build your container by
starting from some other container, either literally or by sticking
another level of build instructions on the end. Layered source
packages don't make any sense when you're thinking like a distribution,
but they make a lot of sense for people who need to modify the
distribution's source packages (this is what dgit makes much
easier, partly because Git is effectively
a layering system; that's one way to look at a sequence of Git
commits).
(My impression of container building is that it's a lot more ad-hoc
than package building. Both Debian and RPM have tried to standardize
and automate a lot of the standard source code building steps, like
running autoconf, but the cost of this is that each of them has a
bespoke set of 'convenient' automation to learn if you want to build
a package from scratch. With containers, you can probably mostly
copy the upstream's shell-based build instructions (or these days,
their Dockerfile).)
Dgit based building of (potentially modified) Debian packages can
be surprisingly close to the container building experience. Like
containers, you first prepare your modifications in a repository
and then you run some relatively simple commands to build the
artifacts you'll actually use. Provided that your modifications
don't change the dependencies, files to be packaged, and so on, you
don't have to care about how Debian defines and manipulates those,
plus you don't even need to know exactly how to build the software
(the Debian stuff takes care of that for you, which is to say that
the Debian package builders have already worked it out).
In general I don't think you can get much closer to the container
build experience other than the dgit build experience or the general
RPM experience (if you're starting from scratch). Packaging takes
work because packages aren't isolated, self
contained objects; they're objects that need to be integrated into
a whole system in a reversible way (ie, you can uninstall them, or
upgrade them even though the upgraded version has a somewhat different
set of files). You need more information, more understanding, and
a more complicated build process.
(Well, I suppose there are flatpaks (and snaps). But these
mostly don't integrate with the rest of your system; they're
explicitly designed to be self-contained, standalone artifacts that
run in a somewhat less isolated environment than containers.)
Suppose, not entirely hypothetically, that you've made local changes
to an Ubuntu package on one Ubuntu release, such as 22.04 ('jammy'),
and now you want to move to another Ubuntu release such as 24.04
('noble'). If you're working with straight 'apt-get source' Ubuntu
source packages, this is done by tediously copying all of your
patches over (hopefully the package uses quilt)
to duplicate and recreate your 22.04 work.
If you're using dgit, this is
much easier. Partly this is because dgit is based on Git, but partly
this is because dgit has an extremely convenient feature where it
can have several different releases in the same Git repository. So
here's what we want to do, assuming you have a dgit repository for
your package already.
(For safety you may want to do this in a copy of your repository.
I make rsync'd copies of Git repositories all the time for stuff
like this.)
Our first step is to fetch the new 24.04 ('noble') version of the
package into our dgit repository as a new dgit branch, and then
check out the branch:
dgit fetch -d ubuntu noble,-security,-updates
dgit checkout noble,-security,-updates
We could do this in one operation but I'd rather do it in two, in
case there are problems with the fetch.
The Git operation we want to do now is to cherry-pick (also) our changes to the 22.04
version of the package onto the 24.04 version of the package. If
this goes well the changes will apply cleanly and we're done.
However, there is a complication. If we've followed the usual
process for making dgit-based local changes,
the last commit on our 22.04 version is an update to debian/changelog.
We don't want that change, because we need to do our own 'gbp dch'
on the 24.04 version after we've moved our own changes over to make
our own 24.04 change to debian/changelog (among other things, the
22.04 changelog change has the wrong version number for the 24.04
package).
In general, cherry-picking all our local changes is 'git cherry-pick
old-upstream..old-local'. To get all but the last change, we want
'old-local~' instead. Dgit has long and somewhat obscure branch
names; its upstream for our 22.04 changes is
'dgit/dgit/jammy,-security,-updates' (ie, the full 'suite' name we
had to use with 'dgit clone' and 'dgit fetch'), while our local
branch is 'dgit/jammy,-security,-updates'. So our full command,
with a 'git log' beforehand to be sure we're getting what we want,
is:
(We've seen this dgit/dgit/... stuff before when doing 'gbp dch'.)
Then we need to make our debian/changelog update. Here, as an
important safety tip, don't blindly copy the command you used while
building the 22.04 package, using 'jammy,...' in the --since argument,
because that will try to create a very confused changelog of
everything between the 22.04 version of the package and the 24.04
version. Instead, you obviously need to update it to your new 'noble'
24.04 upstream, making it:
('git reset --hard HEAD~' may be useful if you make a mistake here.
As they say, ask me how I know.)
If the cherry-pick doesn't apply cleanly, you'll have to resolve
that yourself. If the cherry-pick applies cleanly but the result
doesn't build or perhaps doesn't work because the code has changed
too much, you'll be using various ways to modify and update your
changes. But at
least this is a bunch easier than trying to sort out and update a
quilt-based patch series.
Appendix: Dealing with Ubuntu package updates
Based on thisconversation,
if Ubuntu releases a new version of the package, what I think I need
to do is to use 'dgit fetch' and then explicitly rebase:
dgit fetch -d ubuntu
You have to use '-d ubuntu' here or 'dgit fetch' gets confused and
fails. There may be ways to fix this with git config settings, but
setting them all is exhausting and if you miss one it explodes, so
I'm going to have to use '-d ubuntu' all the time (unless dgit fixes
this someday).
Dgit repositories don't have an explicit Git upstream set, so I
don't think we can use plain rebase. Instead I think we need
the more complicated form:
git rebase dgit/dgit/jammy,-security,-updates dgit/jammy,-security,-updates
(Until I do it for real, these arguments are speculative. I believe
they should work if I understand 'git rebase' correctly, but I'm not
completely sure. I might need the full three argument form and to make
the 'upstream' a commit hash.)
Then, as above, we need to drop our debian/changelog change and redo
it:
(There may be a clever way to tell 'git rebase' to skip the last
change, or you can do an interactive rebase (with '-i') instead of
a non-interactive one and delete it yourself.)
I would really like to be able to patch and rebuild Ubuntu packages
from a git repository with our local changes (re)based on top of
upstream git. It would be much better than quilt'ing and debuild'ing
.dsc packages (I have non-complimentary opinions on the Debian source
package format). This news gives me hope that it'll be possible
someday, but especially for Ubuntu I have no idea how soon or how well
documented it will be.
(It could even be better than RPMs.)
The subsequent discussion got me to try out dgit, especially
since it had an attractive dgit-user(7)
manual page that gave very simple directions on how to make a local
change to an upstream package. It turns out that things aren't
entirely smooth on Ubuntu, but they're workable.
The starting point is 'dgit clone', but on Ubuntu you currently get
to use special arguments that aren't necessary on Debian:
(You don't have to do this on a machine running 'jammy' (Ubuntu
22.04); it may be more convenient to do it from another one, perhaps
with a more up to date dgit.)
The latest Ubuntu package for something may be in either their
<release>-security or their <release>-updates 'suite', so you need
both. I think this is equivalent to what 'apt-get source' gets you,
but you might want to double check. Once you've gotten the source
in a Git repository, you can modify it and commit those modifications
as usual, for example through Magit.
If you have an existing locally patched version of the package that
you did with quilt, you can import all
of the quilt patches, either one by one or all at once and then
using Magit's selective commits to sort things out.
Having made your modifications, whether tentative or otherwise,
you can now automatically modify debian/changelog:
(You might want to use -S for snapshots when testing modifications
and builds, I don't know. Our
practice is to use --local to add a local suffix on the upstream
package number, so we can keep our packages straight.)
The special bit is the 'dgit/dgit/<whatever you used in dgit clone>',
which tells gbp-dch
(part of the gbp suite
of stuff) where to start the changelog from. Using --commit is
optional; what I did was to first run 'gbp dch' without it, then
use 'git diff' to inspect the resulting debian/changelog changes,
and then 'git restore debian/changelog' and re-run it with a better
set of options until eventually I added the '--commit'.
You can then install build-deps (if necessary) and build the binary
packages with the dgit-user(7) recommended 'dpkg-buildpackage
-uc -b'. Normally I'd say that you absolutely want to build source
packages too, but since you have a Git repository with the state
frozen that you can rebuild from, I don't think it's necessary here.
(After the build finishes you can admire 'git status' output that
will tell you just how many files in your source tree the Debian
or Ubuntu package building process modified. One of the nice things
about using Git and building from a Git repository is that you can
trivially fix them all, rather than the usual set of painful
workarounds.)
The dgit-user(7) manual page suggests but doesn't confirm that
if you're bold, you can build from a tree with uncommitted changes.
Personally, even if I was in the process of developing changes I'd
commit them and then make liberal use of rebasing, git-absorb, and so on to keep updating
my (committed) changes.
It's not clear to me how to integrate upstream updates (for example,
a new Ubuntu update to the Dovecot package) with your local changes.
It's possible that 'dgit pull' will automatically rebase your
changes, or give you the opportunity to do that. If not, you can
always do another 'dgit clone' and then manually import your Git
changes as patches.
(A disclaimer: at this point I've only cloned, modified, and built
one package, although it's a real one we use. Still, I'm sold; the
ability to reset the tree after a build is valuable all by itself,
never mind having a better way than quilt to handle making changes.)
Over the years this difference between OpenBSD and FreeBSD was
a common point of discussion, often in overly generalised (and
as a result, deeply inaccurate) terms. Thanks to recent efforts
by Kristof Provost and Kajetan
Staszkiewicz focused on aligning FreeBSDβs pf with the one in
OpenBSD, that discussion can be put to rest.
A change that's important for us in FreeBSD 15.0 is that OpenBSD
style integrated NAT rules are now supported in the FreeBSD PF.
Last year as we were exploring FreeBSD, I wrote about OpenBSD
versus FreeBSD syntax for NAT,
where a single OpenBSD rule that both passed traffic and NAT'd it
had to be split into two FreeBSD rules in the basic version. With
FreeBSD 15, we can write NAT rules using the OpenBSD version of
syntax.
(I'm talking about syntax here because I don't care about how it's
implemented behind the scenes. PF already performs some degree of
ruleset transformations, so if the syntax works and the semantics
don't change, we're happy even if a peek under the hood would show
two rules. But I believe that the FreeBSD 15 changes mean that
FreeBSD now has the OpenBSD implementation of this too.)
So far we've converted two
firewall rulesets to the old PF NAT syntax, one a simple case that's
now in production and a second, more complex one that's not yet in
production. We were holding off on our most complex PF NAT firewall,
which is complex partly because it uses some stuff that's close to
policy based routing. The release
of FreeBSD 15 will make it easier to migrate this firewall (in the
new year, we don't make big firewall changes shortly before our
winter break).
In general, I'm quite happy that FreeBSD and OpenBSD have reached
close to parity in their PF as of FreeBSD 15, because that makes
it easier to chose between them based on what other aspects of them
you like.
(I say 'close to' based on Kristof Provost's comment about the
situation on this entry. The
situation will get even better (ie, closer) in future FreeBSD
versions.)
If you use systemd units or
systemd-run to conveniently capture
output from scripts and programs into the systemd journal, one of
the things that it looks like you don't get is message priorities
and (syslog) facilities. Fortunately, systemd's journal support is
a bit more sophisticated than that.
When you print out regular output and systemd captures it into the
journal, systemd assigns it a default priority that's set with
SyslogLevel=;
this is normally 'info', which is a good default choice. Similarly,
you can pick the syslog facility associated with your unit or your
systemd-run
invocation with SyslogFacility=.
Systemd defaults to 'daemon', which may not entirely be what you
want. On the other hand, the choice of syslog facility matters less
if you're primarily working with journalctl,
where what you usually care about is the systemd unit name.
(You can use journalctl to select messages by priority
or syslog facility
with the -p and --facility options. You can also select by syslog
identifier
with the -t option. This is probably going to be handy for searching
the journal for messages from some of our programs that use syslog
to report things.)
If you know that you're logging to systemd (or you don't care that
your regular output looks a bit weird in spots), you can also print
messages with special priority markers, as covered in sd-daemon(3).
Now that I know about this, I may put it to use in some of our
scripts and programs. Sadly, unlike the normal Linux logger and its
--prio-prefix option, you can't change the syslog facility this
way, but if you're doing pure journald logging you probably don't
care about that.
(It's possible that sd-daemon(3) actually supports the logger
behavior of changing the syslog facility too, but if so it's not
documented and you shouldn't count on it. Instead you should assume
that you have to control the syslog facility through setting
SyslogFacility=, which unfortunately means you can't log just
authentication things to 'auth' and everything else to 'daemon' or
some other appropriate facility.)
PS: Unfortunately, as far as I know journalctl has no way to augment
its normal syslog-like output with some additional fields, such as
the priority or the syslog facility. Instead you have to go all the
way to a verbose dump of information in one of the supported
formats for field selection.
The venerable 'logger' command has been around so long it's part
of the Single Unix Specification (really, logger β log messages).
Although syslog(3)
is in 4.2 BSD (along with syslog(8),
the daemon), it doesn't seem to have been until 4.3 BSD that we got
logger(1),
with more or less the same arguments as the POSIX version.
Unfortunately, if you want to do more than throw messages into your
syslog and actually create well-formed, useful syslog messages, 'logger' has some annoyances and
flaws.
The flaw is front and center in the manual page and the POSIX
specification,
if you read the description of the -i option carefully:
-i: Log the process ID of the logger process with each
message.
(Emphasis mine.)
In shell scripts where you want to report the script's activities
to syslog, it's not unusual to want to report more than one thing.
In well-formed syslog messages,
these would all have the same PID, so that you can tell that they
all came from the same invocation of your script. Logger doesn't
support this; if you run logger several times over the course of
your script and use '-i', every log message will have a different
PID. In some environments (such as FreeBSD and Linux with systemd),
logger usually puts in its own PID whether you like it or not.
(The traditional fake for this was to not use '-i' and then embed
your script's PID into your syslog identifier (FreeBSD even recommends
this in their logger(1)
manual page). This worked okay when syslog identifiers were nothing
more than what got stuck on the front of the message in your log
files, but these days it's not necessarily ideal even if your
'logger' environment doesn't add a PID itself. If you're sending
syslog to a log aggregation system, the identifier can be meaningful
and important and you want it to be a constant for a given message
source so you can search on it.)
Since it's a front end to syslog, logger inherits the traditional
syslog issues that you have to select a meaningful syslog facility,
priority, and identifier (traditionally, the basename of your script).
On the positive side, you can easily vary these from message to
message; on the not so great side, you have to supply them for every
logger invocation and it's on you to make sure all of your uses of
logger use the same ones. Logger doesn't insist that you provide
these and it doesn't have any mechanism (such as a set of environment
variables) for you to provide defaults. This was a bigger issue in
the days before shell functions, since these days you can write a
'logit' function for your shell script that invokes logger correctly
(for your environment). This function is also a good place to
automatically embed your script's PID in the logged message (perhaps
as 'pid=... <supplied message>').
Out of the three of these, the syslog identifier is the easiest to
do a good job of (since you should be picking a meaningful name for
your script anyway) but the traditional syslog environment makes
the identifier relatively meaningless.
It's possible to send all of the output of your script to syslog,
or with a bunch of work you can send just
standard error to syslog (and perhaps repeat it again). But doing
either of these requires wrapping the body of your script up and
feeding all of it to logger:
(Everything will have the same facility and priority, but if it's
really important to log things at a different priority you can put
in direct 'logger' invocations in the body of the script.)
I suspect that people who used logger a lot probably wrote a
wrapper script (you could call it 'stderr-to-syslog') and ran all
of the real scripts under it.
All of this adds up to a collection of small annoyances. It's not
impossible to use logger in scripts to push things into syslog, but
generally it has to be relatively important to capture the information.
There's nothing off the shelf that makes it easy. And if you want
to have portable logging for your scripts, this basic logger use
is all you get.
In a recent entry, I said
in passing that the venerable logger utility had some
amount of annoyances associated with it. In order to explain those
annoyances, I need to first talk about what goes into a well-formed,
useful Unix syslog entry in a traditional Unix syslog environment.
(This is 'well-formed' in a social sense, not in a technical sense
of simply conforming to the syslog message format. There are a lot
of ways to produce technically 'correct' syslog messages that are
neither well formed nor useful.)
A well-formed syslog entry is made up from a number of pieces:
A timestamp, the one thing that you don't have to worry about because
your syslog environment should automatically generate it for you.
(Your syslog environment will also assign a hostname, which you
also don't worry about.)
An appropriate syslog facility, chosen from the assorted options
that you generally find listed in your local syslog(3) (the available
facilities vary from Unix to Unix). Your program may
need to log to multiple different facilities depending on what
the messages are about; for example, a network daemon that does
authentication should probably send authentication related messages
to 'auth' or 'authpriv' and general things to 'daemon'.
An appropriate syslog level (aka priority), where you need to at
least distinguish between informational reports ('info'), things
only of interest during debugging problems ('debug', and probably
normally not logged), and active errors that need attention
('error'). Using more levels is useful if they make sense in your
program.
A meaningful and unique identifier ('tag' in logger) that
identifies your program as the source of the syslog entry and
groups all of its syslog entries together. This is normally
expected to be the name of your program or perhaps your system.
All syslog entries from your program should have this identifier.
Your process ID (PID), to uniquely identify this instance of your
program. Your syslog entries should include a PID even if only
one instance of your program is ever running at a time, because
that lets system administrators match your syslog messages up
with other PID-based information and also tell if and when your
program was restarted.
(Under normal circumstances, all messages logged by a single
instance of your program should use the same PID, because that's
how people match up messages to get all of the ones this particular
instance generated.)
The text and importance of your message text should match the syslog
level of the syslog entry; if your text says 'ERROR' but you logged
at level 'info', this isn't really a well-formed syslog entry. This
goes double if you're using a semi-structured message text format,
so that you actually logged 'level=error ...' at level 'info' (or
the other way around).
All of this is in service to letting people find your program's
syslog entries, pick out the important ones, understand them, and
categorize both your syslog entries and syslog entries from other
programs. If a busy sysadmin wants to see an overview of all
authentication activity, they should be able to look at where they're
sending 'auth' logs. If they want to look for problems, they can
look for 'error' or higher priority logs. And the syslog facility
your program uses should be sensible in general, although there
aren't many options these days (and you should probably allow the
local system administrators to pick what facility you normally use,
so they can assign you a unique local one to collect just your logs
somewhere).
A good library or tool for making syslog entries should make it as
easy as possible to create well-formed, useful syslog entries. I
will note in passing that the traditional syslog(3) API is not
ideal for this, because it assumes that your program will log all
entries in a single facility, which is not necessarily true for
programs that do authentication and something else.
(The short summary is that you probably want to use systemd-run with
a specific unit name that you pick.)
Systemd-cat is
very roughly the systemd equivalent of logger. As you'd
expect, things that it puts in the systemd journal flow through to
anywhere that regular journal entries would, including things
that directly get fed from the journal and syslog (including
remote syslog destinations). The
most convenient way to use systemd-cat is to just have it run a
command, at which point it will capture all of the output from the
command and put it in the journal. However, there is a little issue
with using just 'systemd-cat /some/command', which is that the
journal log identifiers that systemd-cat generates in this case
will be the direct name of whatever program produced the output.
If /some/command is a script that runs a variety of programs that
produce output (perhaps it echos some status information itself
then runs a program, which produces output on its own), you'll
get a mixture of identifier names in the resulting log:
Journal logs written by systemd-cat also inherit whatever unit it
was in (a session unit, cron.service, etc), and the combination can
make it hard to clearly see all of the logs from running your script.
To do better you need to give systemd-cat an explicit identifier,
'systemd-cat -t <something> /some/command', which point everything
is logged with that name, but still in whatever systemd unit
systemd-cat ran in.
Generally you want your script to report all its logs under a single
unit name, so you can find them and sort them out from all of the
other things your system is logging. To do this you need to use
systemd-run with an explicit unit name:
I believe you can then hook this into any systemd service unit
infrastructure you want, such as sending email if the unit fails (if you do, you probably want to add
'--service-type=oneshot'). Using systemd-run this way gets you the
best of both systemd-cat worlds; all of the output from /some/script
will be directly labeled with what program produced it, but you can
find it all using the unit name.
Systemd-run will refuse to activate a unit with a name that duplicates
an existing unit, including existing systemd-run units. In many
cases this is a feature for script use, since you basically get
'run only one copy' locking for free (although the error message
is noisy, so you may want to do your own quiet locking). If you
want to always run your program even if another instance is running,
you'll have to generate non-constant unit names (or let systemd-run
do it for you).
Systemd-cat has some features that systemd-run doesn't offer, such
as setting the priority of messages (and setting a different priority
for standard error output). If these features are important to you,
I'd suggest nesting systemd-cat (with no '-t' argument) inside
systemd-run, so you get both the searchable unit name and the
systemd-cat features. If you're already in an environment with a
useful unit name and you just need to divert log messages from
wherever else the environment wants to send them into the system
journal, bare systemd-cat will do the job.
(Arguably this is the case for things run from cron, if you're
content to look for all of them under cron.service (or crond.service,
depending on your Linux distribution). Running things under systemd-cat
puts their output in the journal instead of having them send you
email, which may be good enough and saves you having to invent and
then remember a bunch of unit names.)
Suppose, not hypothetically,
that you have a third party tool that you need to run periodically.
This tool prints things to standard output (or standard error) that
are potentially useful to capture somehow. You want this captured
output to be associated with the program (or your general system
for running the program) and timestamped, and it would be handy if
the log output wound up in all of the usual places in your systems
for output. Unix has traditionally had some solutions for this, such
as logger
for sending things to syslog, but they all have a certain amount of
annoyances associated with them.
(If you directly run your script or program from cron, you will
automatically capture the output in a nice dated form, but you'll
also get email all the time. Let's assume we want a quieter experience
than email from cron, because you don't need to regularly see the
output, you just want it to be available if you go looking.)
On modern Linux systems, the easy and lazy thing to do is to run
your script or program from a systemd service unit, because systemd
will automatically do this for you and send the result into the
systemd journal (and anything that pulls data from that) and, if
configured, into whatever overall systems you have for handling
syslog logs. You want a unit like this:
[Unit]
Description=Local: Do whatever
ConditionFileIsExecutable=/root/do-whatever
[Service]
Type=oneshot
ExecStart=/root/do-whatever
Unlike the usual setup for running scripts as systemd services, we
don't set 'RemainAfterExit=True' because we want to be able to
repeatedly trigger our script with, for example, 'systemctl start
local-whatever.service'. You can even arrange to get email if
this unit (ie, your script) fails.
You can run this directly from cron through suitable /etc/cron.d
files that use 'systemctl start', or set up a systemd timer unit
(possibly with a randomized start time).
The advantage of a systemd timer unit is that you definitely won't
ever get email about this unless you specifically configure it. If
you're setting up a relatively unimportant and throwaway thing, it being reliably silent is
probably a feature.
(Setting up a systemd timer unit also keeps everything within the
systemd ecosystem rather than worrying about various aspects of
running 'systemctl start' from scripts or crontabs or etc.)
On the one hand, it feels awkward to go all the way to a systemd
service unit simply to get easy to handle logs; it feels like there
should be a better solution somewhere. On the other hand, it works
and it only needs one extra file over what you'd already need (the
.service).
I usually publish articles about how much I love the BSDs or illumos distributions, but today I want to talk about Linux (or, better, GNU/Linux) and why, despite everything, it still holds a place in my heart.
Compare static web hosting performance on an Intel N150 using the same nginx.conf across FreeBSD jails, SmartOS zones, NetBSD, OpenBSD and Linux, focusing on HTTP vs HTTPS and TLS CPU usage.
A while back I wrote about how in POSIX you could theoretically
use inode (number) zero. Not all
Unixes consider inode zero to be valid; prominently, OpenBSD's
getdents(2) doesn't return
valid entries with an inode number of 0, and by extension, OpenBSD's
filesystems won't have anything that uses inode zero. However, Linux
is a different beast.
Some Linux filesystems have been known to return valid entries with
zero inodes. This new behavior also puts Go in agreement with recent
glibc.
This fixes issue #76428,
and the issue has a simple reproduction to create something with inode
numbers of zero. According to the bug report:
[...] On a Linux system with libfuse 3.17.1 or later, you can do this
easily with GVFS:
# Create many dir entries
(cd big && printf '%04x ' {0..1023} | xargs mkdir -p)
gio mount sftp://localhost/$PWD/big
The resulting filesystem mount is in /run/user/$UID/gvfs (see the
issue for the exact
long path) and can be experimentally verified to have entries with
inode numbers of zero (well, as reported by reading the directory).
On systems using glibc 2.37 and later, you can look at this directory
with 'ls' and see the zero inode numbers.
(Interested parties can try their favorite non-C or non-glibc
bindings to see if those environments correctly handle this case.)
That this requires glibc 2.37 is due to this glibc bug, first
opened in 2010 (but rejected at the time for reasons you can read
in the glibc bug) and then resurfaced in 2016 and eventually
fixed in 2022 (and then again in 2024 for the thread safe version
of readdir). The 2016 glibc issue has a bit
of a discussion about the kernel side. As covered in the Go issue,
libfuse returning a zero inode number may be a bug itself, but there are
(many) versions of libfuse out in the wild that actually do this
today.
Of course, libfuse (and gvfs) may not be the only Linux filesystems
and filesystem environments that can create this effect. I believe
there are alternate language bindings and APIs for the kernel FUSE (also, also) support, so they might
have the same bug as libfuse does.
(Both Go and Rust have at least one native binding to the kernel
FUSE driver. I haven't looked at either to see what they do about
inode numbers.)
PS: My understanding of the Linux (kernel) situation is that if you
have something inside the kernel that needs an inode number and you
ask the kernel to give you one (through get_next_ino(), an
internal function for this), the kernel will carefully avoid giving
you inode number 0. A lot of things get inode numbers this way, so
this makes life easier for everyone. However, a filesystem can
decide on inode numbers itself, and when it does it can use inode
number 0 (either explicitly or by zeroing out the d_ino field
in the getdents(2) dirent
structs that it returns, which I believe is what's happening in the
libfuse situation).
The X Window System
has a long standing concept called 'visuals'; to simplify, an X
visual determines how to determine the colors of your pixels. As I
wrote about a number of years ago, these days X11 mostly uses
'TrueColor' visuals, which directly supply
8-bit values for red, green, and blue ('24-bit color'). However X11
has a number of visual types, such as
the straightforward PseudoColor indirect colormap (where every
pixel value is an index into an RGB colormap; typically you'd get
8-bit pixels and 24-bit colormaps, so you could have 256 colors out
of a full 24-bit gamut). One of the (now) obscure visual types is
DirectColor. To quote:
For DirectColor, a pixel value is decomposed into separate RGB
subfields, and each subfield separately indexes the colormap for the
corresponding value. The RGB values can be changed dynamically.
In a PseudoColor visual, each pixel's value is taken as a whole and
used as an index into a colormap that gives the RGB values for that
entry. In DirectColor, the pixel value is split apart into three
values, one each for red, green, and blue, and each value indexes
a separate colormap for that color component. Compared to a PseudoColor
visual of the same pixel depth (size, eg each pixel is an 8-bit
byte), you get less possible variety within a single color component
and (I believe) no more colors in total.
[...] maybe it can be implemented as three LUTs in front of a DACβs
inputs or something where the performance impact is minimal? (Iβm
not a hardware person.) [...]
I was recently reminded of this old entry and when I reread that
comment, an obvious realization struck me about why DirectColor
might make hardware sense. Back in the days of analog video,
essentially every serious sort of video connection between your
computer and your display carried the red, green, and blue components
separately; you can see this in the VGA connector pinouts, and on old
Unix workstations these might literally be separate wires connected
to separate BNC connectors
on your CRT display.
If you're sending the red, green, and blue signals separately you
might also be generating them separately, with one DAC per color
channel. If you have separate DACs, it might be easier to feed them
from separate LUTs
and separate pixel data, especially back in the days when much of
a Unix workstation's graphics system was implemented in relatively
basic, non-custom chips and components. You can split off the bits
from the raw pixel value with basic hardware and then route each
color channel to its own LUT, DAC, and associated circuits (although
presumably you need to drive them with a common clock).
The other way to look at DirectColor is that it's a more flexible
version of TrueColor. A TrueColor visual is effectively a 24-bit
DirectColor visual where the color mappings for red, green, and
blue are fixed rather than variable (this is in fact how it's
described in the X documentation). Making
these mappings variable costs you only a tiny bit of extra memory
(you need 256 bytes for each color) and might require only a bit
of extra hardware in the color generation process, and it enables
the program using the display to change colors on the fly with small
writes to the colormap rather than large writes to the framebuffer
(which, back in the days, were not necessarily very fast). For
instance, if you're looking at a full screen image and you want to
brighten it, you could simply shift the color values in the colormaps
to raise the low values, rather than recompute and redraw all the
pixels.
These days this is mostly irrelevant and the basic simplicity of
the TrueColor visual has won out. Well, what won out is PC graphics
systems that followed the same basic approach of fixed 24-bit RGB
color, and then X went along with it on PC hardware, which became
more or less the only hardware.
(There probably was hardware with DirectColor support. While X on PC Unixes will probably still
claim to support DirectColor visuals, as reported in things like
xdpyinfo, I suspect that it involves software emulation. Although
these days you could probably implement DirectColor with GPU
shaders at basically no cost.)
Polkit is how a lot
of things on modern Linux systems decide whether or not to let
people do privileged operations, including systemd's run0,
which effectively functions as another su or sudo. Polkit normally
has a significantly different authentication model than su or sudo,
where an arbitrary login can authenticate for privileged operations by
giving the password of any 'administrator' account (accounts in group
wheel or group admin, depending on your Linux distribution).
Suppose, not hypothetically, that you want a su like model in Polkit,
one where people in group 'wheel' can authenticate by providing the root
password, while people not in group 'wheel' cannot authenticate for
privileged operations at all. In my earlier entry on learning about
Polkit and adjusting it I put forward an
untested Polkit stanza to do this. Now I've tested it and I can provide
an actual working version.
polkit.addAdminRule(function(action, subject) {
if (subject.isInGroup("wheel")) {
return ["unix-user:0"];
} else {
// must exist but have a locked password
return ["unix-user:nobody"];
}
});
(This goes in /etc/polkit-1/rules.d/50-default.rules, and the
filename is important because it has to replace the standard version
in /usr/share/polkit-1/rules.d.)
This doesn't quite work the way 'su' does, where it will just refuse
to work for people not in group wheel. Instead, if you're not in
group wheel you'll be prompted for the password of 'nobody' (or
whatever other login you're using), which you can never successfully
supply because the password is locked.
As I've experimentally determined, it doesn't work to return an
empty list ('[]'), or a Unix group that doesn't exist
('unix-group:nosuchgroup'), or a Unix group that exists but has no
members. In all cases my Fedora 42 system falls back to asking for
the root password, which I assume is a built-in default for privileged
authentication. Instead you apparently have to return something that
Polkit thinks it can plausibly use to authenticate the person, even if
that authentication can't succeed. Hopefully Polkit will never get
smart enough to work that out and stop accepting accounts with locked
passwords.
(If you want to be friendly and you expect people on your servers
to run into this a lot, you should probably create a login with a
more useful name and GECOS field, perhaps 'not-allowed' and 'You
cannot authenticate for this operation', that has a locked password.
People may or may not realize what's going on, but at least they
have a chance.)
PS: This is with the Fedora 42 version of Polkit, which is version
126. This appears to be the most recent version from the upstream
project.
Sidebar: Disabling Polkit entirely
Initially I assumed that Polkit had explicit rules somewhere that
authorized the 'root' user. However, as far as I can tell this isn't
true; there's no normal rules that specifically authorize root or
any other UID 0 login name, and despite that root can perform actions
that are restricted to groups that root isn't in. I believe this
means that you can explicitly disable all discretionary Polkit
authorization with an '00-disable.rules' file that contains:
Based on experimentation, this disables absolutely everything, even
actions that are considered generally harmless (like libvirt's
'virsh list', which I think normally anyone can do).
A slightly more friendly version can be had by creating a situation
where there are no allowed administrative users. I think this would
be done with a 50-default.rules file that contained:
polkit.addAdminRule(function(action, subject) {
// must exist but have a locked password
return ["unix-user:nobody"];
});
You'd also want to make sure that nobody is in any special groups
that rules in /usr/share/polkit-1/rules.d use to allow automatic
access. You can look for these by grep'ing for 'isInGroup'.
At a high level, Polkit
is how a lot of things on modern Linux systems decide whether or
not to let you do privileged operations. After looking into it a
bit, I've wound up feeling that Polkit
has both good and bad aspects from the perspective of a system
administrator (especially a system administrator with multi-user
Linux systems, where most of the people using them aren't supposed
to have any special privileges). While I've used (desktop) Linuxes
with Polkit for a while and relied on it for a certain amount of
what I was doing, I've done so blindly, effectively as a normal
person. This is the first I've looked at the details of Polkit,
which is why I'm calling this my early reactions.
On the good side, Polkit is a single source of authorization
decisions, much like PAM. On a modern Linux system, there are a
steadily increasing number of programs that do privileged things,
even on servers (such as systemd's run0). These
could all have their own bespoke custom authorization systems, much
as how sudo has its own custom one, but instead most of them have
centralized on Polkit. In theory Polkit gives you a single thing to
look at and a single thing to learn, rather than learning systemd's
authentication system, NetworkManager's authentication system, etc.
It also means that programs have less of a temptation to hard-code
(some of) their authentication rules, because Polkit is very flexible.
(In many cases programs couldn't feasibly use PAM instead, because
they want certain actions to be automatically authorized. For
example, in its standard configuration libvirt wants everyone in
group 'libvirt' to be able to issue libvirt VM management commands
without constantly having to authenticate. PAM could probably be
extended to do this but it would start to get complicated, partly
because PAM configuration files aren't a programming language and
so implementing logic in PAM gets awkward in a hurry.)
On the bad side, Polkit is a non-declarative authorization system,
and a complex one with its rules not in any single place (instead
they're distributed through multiple files in two different formats).
Authorization decisions are normally made in (JavaScript) code,
which means that they can encode essentially arbitrary logic (although
there are standard forms of things). This means that the only way
to know who is authorized to do a particular thing is to read its
XML 'action' file and then look through all of the JavaScript code
to find and then understand things that apply to it.
(Even 'who is authorized' is imprecise by default. Polkit normally
allows anyone to authenticate as any administrative account, provided
that they know its password and possibly other authentication
information. This makes the passwords of people in group wheel or
group admin very dangerous things, since anyone who can get their
hands on one can probably execute any Polkit-protected action.)
This creates a situation where there's no way in Polkit to get a
global overview of who is authorized to do what, or what a particular
person has authorization for, since this doesn't exist in a declarative
form and instead has to be determined on the fly by evaluating code.
Instead you have to know what's customary, like the group that's
'administrative' for your Linux distribution (wheel or admin,
typically) and what special groups (like 'libvirt') do what, or you
have to read and understand all of the JavaScript and XML involved.
In other words, there's no feasible way to audit what Polkit is
allowing people to do on your system. You have to trust that programs
have made sensible decisions in their Polkit configuration (ones
that you agree with), or run the risk of system malfunctions by
turning everything off (or allowing only root to be authorized to
do things).
(Not even Polkit itself can give you visibility into why a decision
was made or fully predict it in advance, because the JavaScript
rules have no pre-filtering to narrow down what they apply to. The
only way you find out what a rule really does is invoking it. Well,
invoking the function that the addRule() or addAdminRule() added to
the rule stack.)
This complexity (and the resulting opacity of authorization) is
probably intrinsic in Polkit's goals. I even think they made the
right decision by having you write logic in JavaScript rather than
try to create their own language for it. However, I do wish Polkit
had a declarative subset that could express all of the simple cases,
reserving JavaScript rules only for complex ones. I think this would
make the overall system much easier for system administrators to
understand and analyze, so we had a much better idea (and much
better control) over who was authorized for what.
Polkit (also, also) is a multi-faceted
user level thing used to control access to privileged operations.
It's probably used by various D-Bus services on your system, which
you can more or less get a list of with pkaction, and
there's a pkexec program
that's like su and sudo. There are two reasons that you might
care about Polkit on your system. First, there might be tools you
want to use that use Polkit, such as systemd's run0 (which
is developing some interesting options). The other is
that Polkit gives people an alternate way to get access to root or
other privileges on your servers and you may have opinions about
that and what authentication should be required.
Unfortunately, Polkit configuration is arcane and as far as I know,
there aren't really any readily accessible options for it. For
instance, if you want to force people to authenticate for root-level
things using the root password instead of their password, as far
as I know you're going to have to write some JavaScript yourself
to define a suitable Administrator identity rule. The
polkit manual page seems
to document what you can put in the code reasonably well, but I'm
not sure how you test your new rules and some areas seem underdocumented
(for example, it's not clear how 'addAdminRule()' can be used to
say that the current user cannot authenticate as an administrative
user at all).
(If and when I wind up needing to test rules, I will probably try
to do it in a scratch virtual machine that I can blow up. Fortunately
Polkit is never likely to be my only way to authenticate things.)
Polkit also has some paper cuts in its current setup. For example,
as far as I can see there's no easy way to tell Polkit-using programs
that you want to immediately authenticate for administrative access
as yourself, rather than be offered a menu of people in group wheel
(yourself included) and having to pick yourself. It's also not clear
to me (and I lack a test system) if the default setup blocks people
who aren't in group wheel (or group admin, depending on your Linux
distribution flavour) from administrative authentication or if
instead they get to pick authenticating using one of your passwords.
I suspect it's the latter.
(All of this makes Polkit seem like it's not really built for
multi-user Linux systems, or at least multi-user systems where not
everyone is an administrator.)
PS: Now that I've looked at it, I have some issues with Polkit from
the perspective of a system administrator, but those are going to be
for another entry.
Sidebar: Some options for Polkit (root) authentication
If you want everyone to authenticate as root for administrative
actions, I think what you want is:
If you want to restrict this to people in group wheel, I think
you want something like:
polkit.addAdminRule(function(action, subject) {
if (subject.isInGroup("wheel")) {
return ["unix-user:0"];
} else {
// might not work to say 'no'?
return [];
}
});
If you want people in group wheel to authenticate as themselves,
not root, I think you return 'unix-user:' + subject.user instead
of 'unix-user:0'. I don't know if people still get prompted by
Polkit to pick a user if there's only one possible user.
We've been moving from OpenBSD to FreeBSD for firewalls. One advantage of this is giving
us a mirrored ZFS pool for the machine's filesystems; we have a lot of experience operating ZFS
and it's a simple, reliable, and fully supported way of getting
mirrored system disks on important machines. ZFS has checksums and
you want to periodically 'scrub' your ZFS pools to verify all of
your data (in all of its copies) through these checksums (ideally
relatively frequently). All
of this is part of basic ZFS knowledge, so I was a little bit
surprised to discover that none of our FreeBSD machines had ever
scrubbed their root pools, despite some of them having been running
for months.
It turns out that while FreeBSD comes with a configuration option
to do periodic ZFS scrubs, the option isn't enabled by default (as
of FreeBSD 14.3). Instead you have to know to enable it, which
admittedly isn't too hard to find once you start looking.
FreeBSD has a general periodic(8) system for
triggering things on a daily, weekly, monthly, or other basis. As
covered in the manual page, the default configuration for this is
in /etc/defaults/periodic.conf and you can override things by
creating or modifying /etc/periodic.conf. ZFS scrubs are a 'daily'
periodic setting, and as of 14.3 the basic thing you want is an
/etc/periodic.conf with:
# Enable ZFS scrubs
daily_scrub_zfs_enable="YES"
FreeBSD will normally scrub each pool a certain number of days after
its previous scrub (either a manual scrub or an automatic scrub
through the periodic system). The default number of days is 35, which
is a bit high for my tastes, so I suggest that you shorten it, making
your periodic.conf stanza be:
There are other options you can set that are covered in
/etc/defaults/periodic.conf.
(That the daily automatic scrubs happen some number of days after
the pool was last scrubbed means that you can adjust their timing
by doing a manual scrub. If you have a bunch of machines that you
set up at the same time, you can get them to space out their scrubs
by scrubbing one a day by hand, and so on.)
Looking at the other ZFS periodic options, I might also enable the
daily ZFS status report, because I'm not certain if there's anything
else that will alert you if or when ZFS starts reporting errors:
# Find out about ZFS errors?
daily_status_zfs_enable="YES"
You can also tell ZFS to TRIM your SSDs every day. As far as I can
see there's no option to do the TRIM less often than once a day; I
guess if you want that you have to create your own weekly or monthly
periodic script (perhaps by copying the 801.trim-zfs daily script
and modifying it appropriately). Or you can just do 'zpool trim
...' every so often by hand.
Every so often I get to be surprised about some Unix thing. Today's
surprise is the actual behavior of '#!' in practice on at least
Linux, FreeBSD, and OpenBSD, which I learned about from a comment
by Aristotle Pagaltzis on my entry
on (not) using '#!/usr/bin/env'. I'll quote
the starting part here:
In fact the shebang line doesnβt require absolute paths, you can
use relative paths too. The path is simply resolved from your current
directory, just as any other path would be β the kernel simply
doesnβt do anything special for shebang line paths at all. [...]
I found this so surprising that I tested it on our Linux servers
as well as a FreeBSD and an OpenBSD machine. On the Linux servers
(and probably on the others too), the kernel really does accept the
full collection of relative paths in '#!'. You can write '#!python3',
'#!bin/python3', '#!../python3', '#!../../../usr/bin/python3', and so
on, and provided that your current directory is in the right place in
the filesystem, they all worked.
(On FreeBSD and OpenBSD I only tested the '#!python3' case.)
As far as I can tell, this behavior goes all the way back to 4.2
BSD (which isn't quite the origin point of '#!' support in the
Unix kernel but is about as close as we can
get). The execve() kernel implementation in sys/kern_exec.c
finds the program from your '#!' line with a namei() call that uses
the same arguments (apart from the name) as it did to find the
initial executable, and that initial executable can definitely be
a relative path.
Although this is probably the easiest way to implement '#!' inside
the kernel, I'm a little bit surprised that it survived in Linux
(in a completely independent implementation) and in OpenBSD (where
the security people might have had a double-take at some point). But
given Hyrum's Law there are probably
people out there who are depending on this behavior so we're now
stuck with it.
(In the kernel, you'd have to go at least a little bit out of your
way to check that the new path starts with a '/' or use a kernel
name lookup function that only resolves absolute paths. Using a
general name lookup function that accepts both absolute and relative
paths is the simplest approach.)
PS: I don't have access to Illumos based systems, other BSDs (NetBSD,
etc), or macOS, but I'd be surprised if they had different behavior.
People with access to less mainstream Unixes (including commercial
ones like AIX) can give it a try to see if there are any Unixes
that don't support relative paths in '#!'.
This is my face when I have quite a few binaries in /usr/sbin on my
office Fedora desktop that aren't owned by any package. Presumably
they were once owned by packages, but the packages got removed without
the files being removed with them, which isn't supposed to happen.
(My office Fedora install has been around for almost 20 years now
without being reinstalled, so things have had time to happen. But some
of these binaries date from 2021.)
There seem to be two sorts of these lingering, unowned /usr/sbin
programs. One sort, such as /usr/sbin/getcaps, seems to have been
left behind when its package moved things to /usr/bin, possibly due
to this RPM bug
(via). The
other sort is genuinely unowned programs dating to anywhere from
2007 (at the oldest) to 2021 (at the newest), which have nothing
else left of them sitting around. The newest programs are what I
believe are wireless management programs: iwconfig, iwevent, iwgetid,
iwlist, iwpriv, and iwspy, and also "ifrename" (which I believe was
also part of a 'wireless-tools' package). I had the wireless-tools
package installed on my office desktop until recently, but I removed
it some time during Fedora 40, probably sparked by the /sbin to
/usr/sbin migration, and it's possible that binaries didn't get
cleaned up properly due to that migration.
The most interesting orphan is /usr/sbin/sln, dating from 2018,
when apparently various people discovered it as an orphan on their
system. Unlike all the other orphan programs, the sln manual page
is still shipped as part of the standard 'man-pages' package and
so you can read sln(8) online.
Based on the manual page, it sounds like it may have been part
of glibc at one point.
(Another orphaned program from 2018 is pam_tally,
although it's coupled to pam_tally2.so, which did get removed.)
I don't know if there's any good way to get mappings from files to
RPM packages for old Fedora versions. If there is, I'd certainly
pick through it to try to find where various of these files came
from originally. Unfortunately I suspect that for sufficiently old
Fedora versions, much of this information is either offline or can't
be processed by modern versions of things like dnf.
(The basic information is used by eg 'dnf provides' and can be built
by hand from the raw RPMs, but I have no desire to download all of
the RPMs for decade-old Fedora versions even if they're still
available somewhere. I'm curious but not that curious.)
PS: At the moment I'm inclined to leave everything as it is until
at least Fedora 43, since RPM bugs are still being sorted out here.
I'll have to clean up genuinely orphaned files at some point but I
don't think there's any rush. And I'm not removing any more old
packages that use '/sbin/<whatever>', since that seems like it has
some bugs.
In the decades-long process of getting my fvwm config JUST RIGHT, my
xterm right-click menu now has a "duplicate" command, which opens
a new xterm with the same geometry, on the same node, IN THE SAME
DIRECTORY. (Directory info aquired via /proc.)
I have a long-standing shell function in my shell that attempts to
do this (imaginatively called 'spawn'), but this is only available
in environments where my shell is set up, so I was quite interested
in the whole area and did some experiments. The good news is that
xterm's 'spawn-new-terminal' works, in that it will start a new
xterm and the new xterm will be in the right directory. The bad
news for me is that that's about all that it will do, and in my
environment this has two limitations that will probably make it
not something I use a lot.
The first limitation is that this starts an xterm that doesn't copy
the command line state or settings of the parent xterm. If you've
set special options on the parent xterm (for example, you like your
root xterms to have a red foreground), this won't be carried over
to the new xterm. Similarly, if you've increased (or decreased) the
font size in your current xterm or otherwise changed its settings,
spawn-new-terminal doesn't duplicate these; you get a default xterm.
This is reasonable but disappointing.
(While spawn-new-terminal takes arguments that I believe it will
pass to the new xterm, as far as I know there's no way to retrieve
the current xterm's command line arguments to insert them here.)
The larger limitation for me is that when I'm at home, I'm often
running SSH inside of an xterm in order to log in to some other
system (I have a 'sshterm' script to automate all the aspects of
this). What I really want when I 'duplicate' such an xterm is not
a copy of the local xterm running a local shell (or even starting
another SSH to the remove system), but the remote (shell) context,
with the same (remote) current directory and so on. This is impossible
to get in general and difficult to set up even for situations where
it's theoretically possible. To use spawn-new-terminal effectively,
you basically need either all local xterms or copious use of remote
X forwarded over SSH (where the xterm is running on the remote
system, so a duplicate of it will be as well and can get the right
current directory).
Going through this experience has given me some ideas on how to
improve the situation overall. Probably I should write a 'spawn'
shell script to replace or augment my 'spawn' shell function so I
can readily have it in more places. Then when I'm ssh'd in to a
system, I can make the 'spawn' script at least print out a command
line or two for me to copy and paste to get set up again.
(Two command lines is the easiest approach, with one command that
starts the right xterm plus SSH combination and the other a 'cd'
to the right place that I'd execute in the new logged in window.
It's probably possible to combine these into an all-in-one script
but that starts to get too clever in various ways, especially as
SSH has no straightforward way to pass extra information to a login
shell.)
The root issue appears to be that when I removed the
selinux-policy-targeted package, I probably should have edited
/etc/selinux/config to set SELINUXTYPE to some bogus value, not
left it set to "targeted". For entirely sensible reasons, various
packages have postinstall scripts that assume that if your SELinux
configuration says your SELinux type is 'targeted', they can do
things that implicitly or explicitly require things from the package
or from the selinux-policy package, which got removed when I removed
selinux-policy-targeted.
I'm not sure if my change to SELINUXTYPE will completely fix
things, because I suspect that there are other assumptions about
SELinux policy programs and data files being present lurking in
standard, still-installed package tools and so on. Some of these
standard SELinux related packages definitely can't be removed without
gutting Fedora of things that are important to me, so I'll either
have to live with periodic failures of postinstall scripts or put
selinux-policy-targeted and some other bits back. On the whole,
reinstalling selinux-policy-targeted is probably the safest way and
the issue that caused me to remove it only applies during Fedora
version upgrades and might anyway be
fixed in Fedora 42.
What this illustrates to me is that regardless of package dependencies,
SELinux is not really optional on Fedora. The Fedora environment
assumes that a functioning SELinux environment is there and if it
isn't, things are likely to go wrong. I can't blame Fedora for this,
or for not fully capturing this in package dependencies (and Fedora
did protect the selinux-policy-targeted package from being removed;
I overrode that by hand, so what happens afterward is on me).
(Although I haven't checked modern versions of Fedora, I suspect
that there's no official way to install Fedora without getting a
SELinux policy package installed, and possibly selinux-policy-targeted
specifically.)
PS: I still plan to temporarily remove selinux-policy-targeted when
I upgrade my home desktop to Fedora 42. A few package postinstall
glitches is better than not being able to read DNF output due to
the package's spam.
V nedavnem referendumu o prostovoljnem konΔanju ΕΎivljenja lahko vidimo poskus, da bi konΔno izpolnili program francoske revolucije. V Δasih pred burΕΎoazno revolucijo so v Evropi cerkve imele monopol nad potekom ΕΎivljenja. Rojstva, poroke, smrt, glavne ΕΎivljenjske postaje, je cerkev obvladovala s svojimi zakramenti. BurΕΎoazna drΕΎava je sΔasoma prevzela register rojstev in poroke, s podrΕΎavljenjem nadzora nad prebivalstvom je ljudi osvobodila cerkvenega gospostva. S svobodnim odloΔanjem o porajanju otrok je zagotovila svobodno upravljanje z ΕΎivljenjem β od rojstva do smrti. Ne pa o smrti sami! Referendum 23. novembra bi nas lahko osvobodil tudi na tej toΔki, na zadnjem oporiΕ‘Δu cerkvene moΔi. Pa nas ni!
Zakaj se drΕΎavljanke in drΕΎavljani niso hoteli otresti zadnjega srednjeveΕ‘kega jarma? Razlogi za glasovanje Β»protiΒ« so bili razliΔni, nas pouΔujejo strokovnjaki. Ideologija Β»svetosti ΕΎivljenjaΒ« in pokorΕ‘Δina papeΕ‘ki cerkvi sta oΔitna razloga. A za nas pomembnejΕ‘i je razlog tistih, ki so menili, da je zakon slab. Zakon je res moΔno zapletel prostovoljno konΔanje ΕΎivljenja. Bolj ko zapletajo pravne postopke, veΔja je verjetnost, da bodo poveΔevali tudi moΕΎnosti za izigravanje zakona, mnoΕΎili Β»pravne luknjeΒ«.
Nasploh zmore burΕΎoazna pravna drΕΎava Β»osvoboditiΒ« posameznice in posameznike samo s pomoΔjo Β»pravne drΕΎaveΒ«, tj. pravnega fetiΕ‘izma. To pa ni emancipacija Δloveka. Kritiko francoske revolucije je objavil Karl Marx ΕΎe leta 1844 v Prispevku k judovskemu vpraΕ‘anju. Francoska revolucija, je napisal, osvobodi individua zgolj politiΔno in s tem razbije druΕΎbo na atomizirane individue, ki jih povezuje samo Ε‘e pravo.Β
Preprosto reΔeno: burΕΎoazna revolucija uvede boj vseh proti vsem v mejah zakonov burΕΎoazne pravne drΕΎave. Med drugim je to tudi temelj za Β»svobodnoΒ« delovno pogodbo med proletarko, proletarcem in kapitalistom β ta pogodba pa je osnova za kapitalistiΔno izkoriΕ‘Δanje. BurΕΎoazni pravni fetiΕ‘izem ne zagotavlja svobodnega soΕΎitja v solidarni druΕΎbi.
Β»ProtiΒ« je brΕΎkone glasoval tudi marsikdo, ki je hotel s tem protestirati proti politiki sedanje vlade. A proti vladi je lahko glasoval spet iz razliΔnih razlogov: npr. zato ker ni vzpostavila trdnega javnega zdravstva β ali pa zato ker ni zdravstva dokonΔno sprivatizirala. Torej iz popolnoma nasprotnih razlogov.
Iz tega, da razliΔni, celo nasprotujoΔi si razlogi pripeljejo do enake odloΔitve na referendumu, lahko razberemo sploΕ‘no omejenost burΕΎoaznega politiΔnega sistema. Zakon je pripravila stranka, podprle so ga koalicijske stranke, sprejel ga je strankarski parlament. Med pripravo zakona so v javnosti nekoliko, sicer ne preveΔ zavzeto, razpravljali o vpraΕ‘anju prostovoljnega konΔanja ΕΎivljenja. A koliko so misli iz razprave upoΕ‘tevali pri pisanju zakona, so odloΔale stranke.Β
Pred referendumom je bila razprava o zakonu resda ΕΎivahna β a ni veΔ mogla vplivati na zakon. Idej, ki so jih predstavili v razpravi pred referendumom, ni bilo mogoΔe ustvarjalno uporabiti. Ujete so bile v skopo izbiro za zakon ali proti zakonu, kakrΕ‘nega so doloΔile stranke. Strankarska demokracija ne zmore izkoristiti vseh intelektualnih moΔi druΕΎbe. Tudi na volitvah se lahko odloΔamo zgolj za strankarske liste ali proti njim. Navsezadnje se odloΔimo za najmanj slabo moΕΎnost. BurΕΎoazna demokracija ni demokratiΔna.
Zato smo spet pred starim vpraΕ‘anjem: je treba najprej do konca izpeljati burΕΎoazno revolucijo (Δlovekove pravice, pravna drΕΎava, burΕΎoazni parlamentarizem) β ali je na dnevnem redu socialistiΔna revolucija in se moramo bojevati za odpravo razredov in izkoriΕ‘Δanja, za podruΕΎbljenje produkcijskih sredstev in odloΔanja, za solidarno soΕΎitje?
V kapitalizmu ni mogoΔe uresniΔiti obljub burΕΎoazne revolucije. Brez neplaΔanega dela v gospodinjstvu bi se kapitalizem sesul. V vseh obdobjih je potreboval nesvobodno delo β od Β»tradicionalnihΒ« odnosov v kolonijah do suΕΎenjstva na ameriΕ‘kem jugu in migrantskega dela zdaj.
Zato bi s popravljanjem kapitalizma samo cepetali v zgodovinski slepi ulici. Na dnevnem redu je socialistiΔna revolucija. Ε e zlasti za nas, ki smo jo Ε‘e nedavno prakticirali.
GostujoΔe pero je napisal Rastko MoΔnik
GOSTUJOΔI PRISPEVEK // RP je odprta platforma in omogoΔa objavo prispevkov avtoric, ki se dotikajo naprednih bojev ali vpraΕ‘anj
Once upon a time, Unix filesystem mounts worked by putting one
inode on top of another, and this
was also how they worked in very early Linux. It wasn't wrong to
say that mounts were really about inodes, with the names only being
used to find the inodes. This is no longer how things work in Linux
(and perhaps other Unixes, but Linux is what I'm most familiar with
for this). Today, I believe that filesystem mounts in Linux are
best understood as namespace operations.
Each separate (unmounted) filesystem is a a tree of names (a
namespace). At a broad level, filesystem mounts in Linux take some
name from that filesystem tree and project it on top of something
in an existing namespace,
generally with some properties attached to the projection. A regular
conventional mount takes the root name of the new filesystem and
puts the whole tree somewhere, but for a long time Linux's bind
mounts took some other name in the filesystem as their starting
point (what we could call the root inode of the
mount). In modern Linux, there can also be multiple mount namespaces in
existence at one time, with different contents and properties. A
filesystem mount does not necessarily appear in all of them, and
different things can be mounted at the same spot in the tree of
names in different mount namespaces.
(Some mount properties are still global to the filesystem as a
whole, while other mount properties are specific to a particular
mount. See mount(2)
for a discussion of general mount properties. I don't know if there's
a mechanism to handle filesystem specific mount properties on a per
mount basis.)
This can't really be implemented with an inode-based view of mounts.
You can somewhat implement traditional Linux bind mounts with an
inode based approach, but mount namespaces have to be separate from
the underlying inodes. At a minimum a mount point must be a pair
of 'this inode in this namespace has something on top of it', instead
of just 'this inode has something on top of it'.
(A pure inode based approach has problems going up the directory
tree even in old bind mounts, because the parent directory of a
particular directory depends on how you got to the directory. If
/usr/share is part of /usr and you bind mounted /usr/share to /a/b,
the value of '..' depends on if you're looking at '/usr/share/..'
or '/a/b/..', even though /usr/share and /a/b are the same inode
in the /usr filesystem.)
If I'm reading manual pages correctly, Linux still normally requires
the initial mount of any particular filesystem be of its root name
(its true root inode). Only after that initial mount is made can
you make bind mounts to pull out some subset of its tree of names
and then unmount the original full filesystem mount. I believe that
a particular filesystem can provide ways to sidestep this with a
filesystem specific mount option, such as btrfs's subvol= mount
option that's covered in the btrfs(5) manual page (or 'btrfs
subvolume set-default').
And are there any circumstances where a mount can be done without a
pre-existing mount point (i.e. a mount point appears out of thin air)?
I think there is one answer for why this is a good idea in general
and otherwise complex to do, although you can argue about it, and
then a second historical answer based on how mount points were
initially implemented.
The general problem is directory listings. We obviously want and
need mount points to appear in readdir() results, but in the kernel,
directory listings are historically the responsibility of filesystems
and are generated and returned in pieces on the fly (which is clearly
necessary if you have a giant directory; the kernel doesn't read
the entire thing into memory and then start giving your program
slices out of it as you ask). If mount points never appear in the
underlying directory, then they must be inserted at some point in
this process. If mount points can sometimes exist and sometimes
not, it's worse; you need to somehow keep track of which ones
actually exist and then add the ones that don't at the end of the
directory listing. The simplest way to make sure that mount points
always exist in directory listings is to require them to have an
existence in the underlying filesystem.
The historical answer is that in early versions of Unix, filesystems
were actually mounted on top of inodes, not directories (or filesystem
objects). When you passed a (directory) path to the mount(2) system
call, all it was used for was getting the corresponding inode, which
was then flagged as '(this) inode is mounted on' and linked (sort
of) to the new mounted filesystem on top of it. All of the things
that dealt with mount points and mounted filesystem did so by inode
and inode number, with no further use of the paths and the root
inode of the mounted filesystem being quietly substituted for the
mounted-on inode. All of the mechanics of this needed the inode and
directory entry for the name to actually exist (and V7 required the
name to be a directory).
I don't think modern kernels (Linux or otherwise) still use this
approach to handling mounts, but I believe it lingered on for quite
a while. And it's a sufficiently obvious and attractive implementation
choice that early versions of Linux also used it (see the Linux
0.96c version of iget() in fs/inode.c).
Sidebar: The details of how mounts worked in V7
When you passed a path to the mount(2) system call (called 'smount()'
in sys/sys3.c), it
used the name to get the inode and then set the IMOUNT flag from
sys/h/inode.h
on it (and put the mount details in a fixed size array of mounts,
which wasn't very big). When
iget() in sys/iget.c was
fetching inodes for you and you'd asked for an IMOUNT inode, it
gave you the root inode of the filesystem instead, which worked in
cooperation with name lookup in a directory (the name lookup in the
directory would find the underlying inode number, and then iget()
would turn it into the mounted filesystem's root inode). This gave
Research Unix a simple, low code approach to finding and checking
for mount points, at the cost of pinning a few more inodes into
memory (not necessarily a small thing when even a big V7 system
only had at most 200 inodes in memory at once, but then a big V7
system was limited to 8 mounts, see h/param.h).
I've mentioned this before in passing (cf,
also) but today I feel like saying it
explicitly: our habit with all
of our machines is to never apply a kernel update without immediately
rebooting the machine into the new kernel. On our Ubuntu machines
this is done by holding the relevant kernel packages; on my Fedora
desktops I normally run 'dnf update --exclude "kernel*"' unless I'm
willing to reboot on the spot.
The obvious reason for this is that we want to switch to the new
kernel under controlled, attended conditions when we'll be able to
take immediate action if something is wrong, rather than possibly
have the new kernel activate at some random time without us present
and paying attention if there's a power failure, a kernel panic,
or whatever. This is especially acute on my desktops, where I use
ZFS by building my own OpenZFS packages
and kernel modules. If something goes wrong and the kernel modules
don't load or don't work right, an unattended reboot can leave my
desktops completely unusable and off the network until I can get
to them. I'd rather avoid that if possible (sometimes it isn't).
(In general I prefer to reboot my Fedora machines with me present
because weird things happen from time to time and sometimes I
make mistakes, also.)
The less obvious reason is that when you reboot a machine right
after applying a kernel update, it's clear in your mind that the
machine has switched to a new kernel. If there are system problems
in the days immediately afterward the update, you're relatively
likely to remember this and at least consider the possibility that
the new kernel is involved. If you apply a kernel update, walk away
without rebooting, and the machine reboots a week and a half later
for some unrelated reason, you may not remember that one of the
things the reboot did was switch to a new kernel.
(Kernels aren't the only thing that this can happen with, since not
all system updates and changes take effect immediately when made or
applied. Perhaps one should reboot after making them, too.)
I'm assuming here that your Linux distribution's package management
system is sensible, so there's no risk of losing old kernels
(especially the one you're currently running) merely because you
installed some new ones but didn't reboot into them. This is how
Debian and Ubuntu behave (if you don't 'apt autoremove' kernels),
but not quite how Fedora's dnf does it (as far as I know). Fedora
dnf keeps the N most recent kernels around and probably doesn't let
you remove the currently running kernel even if it's more than N
kernels old, but I don't believe it tracks whether or not you've
rebooted into those N kernels and stretches the N out if you haven't
(or removes more recent installed kernels that you've never rebooted
into, instead of older kernels that you did use at one point).
PS: Of course if kernel updates were perfect this wouldn't matter.
However this isn't something you can assume for the Linux kernel
(especially as patched by your distribution), as we've sometimes
seen. Although big issues like that are
relatively uncommon.
A few years ago I wrote about the divide in chown() about who got
to give away files, where BSD and V7 were
on one side, restricting it to root, while System III and System V
were on the other, allowing the owner to give them away too. At the
time I quoted the V7 chown(2)
explanation of this:
[...] Only the super-user may execute this call, because if users
were able to give files away, they could defeat the (nonexistent)
file-space accounting procedures.
Recently, for reasons,
chown(2) and its history was on my mind and so I wondered if the early
Research Unixes had always had this, or if a restriction was added at
some point.
(Since I looked it up, the restriction on chown()'ing setuid files
was lifted in V4. In V4 and later, a setuid file has its setuid bit
removed on chown; in V3 you still can't give away such a file,
according to the V3 chown(2) manual page.)
At this point you might wonder where the System III
and System V unrestricted chown came from. The surprising to me
answer seems to be that System III partly descends from PWB/UNIX, and PWB/UNIX 1.0, although
it was theoretically based on V6, has pre-V6 chown(2) behavior
(kernel source,
manual page). I
suspect that there's a story both to why V6 made chown() more
restricted and also why PWB/UNIX specifically didn't take that
change from V6, but I don't know if it's been documented anywhere
(a casual Internet search didn't turn up anything).
(The System III chown(2) manual page
says more or less the same thing as the PWB/UNIX manual page, just
more formally, and the kernel code is very similar.)
What's common in both cases is that NFS servers and OverlayFS both
must create an 'identity' for a file (a NFS filehandle and an inode number, respectively). In the
case of NFS servers, this identity has some strict requirements;
OverlayFS has a somewhat easier life, but in general it still has
to create and track some amount of information. Based on reading
the OverlayFS article,
I believe that OverlayFS considers this expensive enough to only
want to do it when it has to.
OverlayFS definitely needs to go to this effort when people call
stat(), because various programs will directly use the inode number
(the POSIX 'file serial number') to tell files on the same filesystem
apart. POSIX technically requires OverlayFS to do this for readdir(),
but in practice almost everyone that uses readdir() isn't going to
look at the inode number; they look at the file name and perhaps
the d_type field to spot directories without needing to stat()
everything.
If there was a special 'not a valid inode number' signal value,
OverlayFS might use that, but there isn't one (in either POSIX or
Linux, which is actually a problem). Since OverlayFS needs to provide
some sort of arguably valid inode number, and since it's reading
directories from the underlying filesystems, passing through their
inode numbers from their d_ino fields is the simple answer.
Sidebar: Why there should be a 'not a valid inode number' signal value
Because both standards and common Unix usage include a d_ino
field in the structure readdir() returns, they embed the idea that
the stat()-visible inode number can easily be recovered or generated
by filesystems purely by reading directories, without needing to
perform additional IO. This is true in traditional Unix filesystems,
but it's not obvious that you would do that all of the time in all
filesystems. The on disk format of directories might only have some
sort of object identifier for each name that's not easily mapped
to a relatively small 'inode number' (which is required to be some
C integer type), and instead the 'inode number' is an attribute you
get by reading file metadata based on that object identifier (which
you'll do for stat() but would like to avoid for reading directories).
But in practice if you want to design a Unix filesystem that performs
decently well and doesn't just make up inode numbers in readdir(),
you must store a potentially duplicate copy of your 'inode numbers'
in directory entries.
Suppose, not hypothetically, that your system is running some systemd
based service or daemon that resets or erase your carefully cultivated
state when it restarts. One example is systemd-networkd, although you can turn that off (or
parts of it off, at least), but there are likely others. To clean
up after this happens, you'd like to automatically restart or redo
something after a systemd unit is restarted. Systemd supports this,
but I found it slightly unclear how you want to do this and today
I poked at it, so it's time for notes.
First, you need to put whatever you want to do into a script and a
.service unit that will run the script. The traditional way to run
a script through a .service unit is:
To get this unit to run after another unit is started or restarted,
what you need is PartOf=,
which causes your unit to be stopped and started when the other
unit is, along with 'After=' so that your unit starts after the
other unit instead of racing it (which could be counterproductive
when what you want to do is fix up something from the other unit).
So you add:
(This is what works for me in light testing. This assumes that
the unit you want to re-run after is normally always running,
as systemd-networkd is.)
In testing, you don't need to have your unit specifically enabled
by itself, although you may want it to be for clarity and other
reasons. Even if your unit isn't specifically enabled, systemd will
start it after the other unit because of the PartOf=. If the other
unit is started all of the time (as is usually the case for
systemd-networkd), this effectively makes your unit enabled, although
not in an obvious way (which is why I think you should specifically
'systemctl enable' it, to make it obvious). I think you can have
your .service unit enabled and active without having the other unit
enabled, or even present.
You can declare yourself PartOf a .target unit, and some stock
package systemd units do for various services. And a .target unit
can be PartOf a .service; on Fedora, 'sshd-keygen.target' is PartOf
sshd.service in a surprisingly clever little arrangement to generate
only the necessary keys through a templated 'sshd-keygen@.service'
unit.
I admit that the whole collection of Wants=, Requires=, Requisite=,
BindsTo=,
PartOf=, Upholds=, and so on are somewhat confusing to me. In the
past, I've used the wrong version and suffered the consequences, and I'm not sure I have them entirely
right in this entry.
Note that as far as I know, PartOf= has those Requires= consequences, where if the other unit is stopped,
yours will be too. In a simple 'run a script after the other unit
starts' situation, stopping your unit does nothing and can be
ignored.
(If this seems complicated, well, I think it is, and I think one
part of the complication is that we're trying to use systemd as
an event-based system when it isn't one.)
A while ago I wrote an entry about things that resolved wasn't
for as of systemd 251. One of those things
was arbitrary mappings of (DNS) names to DNS servers, for example
if you always wanted *.internal.example.org to query a special
DNS server. Systemd-resolved
didn't have a direct feature for this and attempting to attach your
DNS names to DNS server mappings to a network interface could go
wrong in various ways. Well, time marches on and as of systemd
v258 this
is no longer the state of affairs.
Systemd v258 introduces systemd.dns-delegate
files, which allow you to map DNS names to DNS servers independently
from network interfaces. The release notes describe
this as:
A new DNS "delegate zone" concept has been introduced, which are
additional lookup scopes (on top of the existing per-interface
and the one global scope so far supported in resolved), which
carry one or more DNS server addresses and a DNS search/routing
domain. It allows routing requests to specific domains to specific
servers. Delegate zones can be configured via drop-ins below
/etc/systemd/dns-delegate.d/*.dns-delegate.
Since systemd v258 is very new I don't have any machines where I
can actually try this out, but based on the systemd.dns-delegate
documentation, you can use this both for domains that you merely
want diverted to some DNS server and also domains that you also
want on your search path. Per resolved.conf's Domains=
documentation, the latter is 'Domains=example.org' (example.org
will be one of the domains that resolved tries to find single-label
hostnames in, a search domain), and the former is 'Domains=~example.org'
(where we merely send queries for everything under 'example.org'
off to whatever DNS=
you set, a route-only domain).
(While resolved.conf's Domains=
officially promises to check your search domains in the order you
listed them, I believe this is strictly for a single 'Domains='
setting for a single interface. If you have multiple 'Domains='
settings, for example in a global resolved.conf, a network interface,
and now in a delegation, I think systemd-resolved makes no promises.)
Right now, these DNS server delegations can only be set through
static files, not manipulated through resolvectl.
I believe fiddling with them through resolvectl is on the roadmap, but
for now I guess we get to restart resolved if we need to change things.
In fact resolvectl doesn't expose anything to do with them, although I
believe read-only information is available via D-Bus and maybe varlink.
Given the timing of systemd v258's release relative to Fedora
releases, I probably won't be able to use this feature until Fedora
44 in the spring (Fedora 42 is current and Fedora 43 is imminent,
which won't have systemd v258 given that v258 was released only a
couple of weeks ago). My current systemd-resolved setup is okay
(if it wasn't I'd be doing something else), but I can probably find
uses for these delegations to improve it.
Bashβs fallback getcwd() assumes that the inode [number] from stat()
matches one returned by readdir(). OverlayFS breaks that assumption.
I wouldn't call this an 'assumption' so much as 'sane POSIX semantics',
although I'm not sure that POSIX absolutely requires this.
As we've seen before, POSIX talks about
'file serial number(s)' instead of inode numbers. The best definition
of these is covered in sys/stat.h,
where we see that a 'file identity' is uniquely determined by the
combination the inode number and the device ID (st_dev), and
POSIX says that 'at any given time in a system, distinct files shall
have distinct file identities' while hardlinks have the same identity.
The POSIX description of readdir()
and dirent.h
don't caveat the d_ino file serial numbers from readdir(),
so they're implicitly covered by the general rules for file serial
numbers.
In theory you can claim that the POSIX guarantees don't apply here
since readdir() is only supplying d_ino, the file serial number,
not the device ID as well. I maintain that this fails due to a POSIX
requirement:
[...] The value of the structure's d_ino member shall be set to
the file serial number of the file named by the d_name member.
[...]
If readdir() gives one file serial number and a fstatat()
of the same name gives another, a plain reading of POSIX is that
one of them is lying. Files don't have two file serial numbers,
they have one. Readdir() can return duplicate d_ino numbers for
files that aren't hardlinks to each other (and I think legitimately
may do so in some unusual circumstances), but it can't return
something different than what fstatat() does for the same name.
The perverse argument here turns on POSIX's 'at any given time'.
You can argue that the readdir() is at one time and the stat() is
at another time and the system is allowed to entirely change file
serial numbers between the two times. This is certainly not the
intent of POSIX's language but I'm not sure there's anything in the
standard that rules it out, even though it makes file serial numbers
fairly useless since there's no POSIX way to get a bunch of them
at 'a given time' so they have to be coherent.
So to summarize, OverlayFS has chosen what are effectively non-POSIX
semantics for its readdir() inode numbers (under some circumstances,
in the interests of performance) and Bash used readdir()'s d_ino
in a traditional Unix way that caused it to notice. Unix filesystems
can depart from POSIX semantics if they want, but I'd prefer if
they were a bit more shamefaced about it. People (ie, programs)
count on those semantics.
(The truly traditional getcwd() way
wouldn't have been a problem, because it predates readdir() having
d_ino and so doesn't use it (it stat()s everything to get inode
numbers). I reflexively follow this pre-d_ino algorithm when
I'm talking about doing getcwd() by hand (cf), but these days you want to use the
dirent d_ino and if possible d_type, because they're much
more efficient than stat()'ing everything.)
Which virtualization host performs better? I put FreeBSD and SmartOS in a head-to-head showdown. The performance of Jails, Zones, and bhyve VMs surprised me, forcing a second round of tests on different hardware to find the real winner.
Historically, Unix mail programs (what we call 'mail clients' or
'mail user agents' today) have had two different approaches to
handling your email, what I'll call the shared approach and the
exclusive approach, with the shared approach being the dominant
one. To explain the shared approach, I have to back up to talk about
what Unix mail transfer agents (MTAs) traditionally did. When a
Unix MTA delivered email to you, at first it delivered email into
a single file in a specific location (such as '/usr/spool/mail/<login>')
in a specific format, initially mbox; even then, this could be
called your 'inbox'. Later, when the maildir mailbox format became
popular, some MTAs gained the ability to deliver to maildir format
inboxes.
(There have been a number of Unix mail spool formats over the
years, which I'm not going to try to get into here.)
A 'shared' style mail program worked directly with your inbox in
whatever format it was in and whatever location it was in. This is
how the V7 'mail' program worked,
for example. Naturally these programs didn't have to work on your
inbox; you could generally point them at another mailbox in the
same format. I call this style 'shared' because you could use any
number of different mail programs (mail clients) on your mailboxes,
providing that they all understood the format and also provided that
all of them agreed on how to lock your mailbox against modifications,
including against your system's MTA delivering new email right at the
point where your mail program was, for example, trying to delete some.
(Locking issues are one of the things that maildir was designed
to help with.)
An 'exclusive' style mail program (or system) was designed to own
your email itself, rather than try to share your system mailbox.
Of course it had to access your system mailbox a bit to get at your
email, but broadly the only thing an exclusive mail program did
with your inbox was pull all your new email out of it, write it
into the program's own storage format and system, and then usually
empty out your system inbox. I call this style 'exclusive' because
you generally couldn't hop back and forth between mail programs
(mail clients) and would be mostly stuck with your pick, since your
main mail program was probably the only one that could really work
with its particular storage format.
(Pragmatically, only locking your system mailbox for a short period
of time and only doing simple things with it tended to make things
relatively reliable. Shared style mail programs had much more room
for mistakes and explosions, since they had to do more complex
operations, at least on mbox format mailboxes. Being easy to modify
is another advantage of the maildir format, since it outsources
a lot of the work to your Unix filesystem.)
This shared versus exclusive design choice turned out to have some
effects when mail moved to being on separate servers and accessed
via POP and then later IMAP. My impression is that 'exclusive'
systems coped fairly well with POP, because the natural operation
with POP is to pull all of your new email out of the server and
store it locally. By contrast, shared systems coped much better
with IMAP than exclusive ones did, because IMAP is inherently a
shared mail environment where your mail stays on the IMAP server
and you manipulate it there.
(Since IMAP is the dominant way that mail clients/user agents get
at email today, my impression is that the 'exclusive' approach is
basically dead at this point as a general way of doing mail clients.
Almost no one wants to use an IMAP client that immediately moves
all of their email into a purely local data storage of some sort;
they want their email to stay on the IMAP server and be accessible
from and by multiple clients and even devices.)
Most classical Unix mail clients are 'shared' style programs, things
like Alpine, Mutt, and the basic Mail program. One major 'exclusive'
style program, really a system, is (N)MH (also). MH is somewhat notable because in
its time it was popular enough that a number of other mail programs
and mail systems supported its basic storage format to some degree
(for example, procmail can deliver messages to MH-format directories,
although it doesn't update all of the things that MH would do in
the process).
Another major source of 'exclusive' style mail handling systems is
GNU Emacs. I believe that both rmail and
GNUS
normally pull your email from your system inbox into their own
storage formats, partly so that they can take exclusive ownership
and don't have to worry about locking issues with other mail clients.
GNU Emacs has a number of mail reading environments (cf,
also) and I'm not sure what the
others do (apart from MH-E, which is a frontend on (N)MH).
(There have probably been other 'exclusive' style systems. Also,
it's a pity that as far as I know, MH never grew any support for
keeping its messages in maildir format directories, which are
relatively close to MH's native format.)
One of the traditional rites of passage for Linux system administrators
is having a daemon not work in the normal system configuration (eg,
when you boot the system) but work when you manually run it as root.
The classical cause of this on Unix was that $PATH wasn't fully set
in the environment the daemon was running in but was in your root
shell. On Linux, another traditional cause of this sort of thing
has been SELinux and a more modern source (on
Ubuntu) has sometimes been AppArmor. All of these create hard to
see differences between your root shell (where the daemon works
when run by hand) and the normal system environment (where the
daemon doesn't work). These days, we can add another cause, an
increasingly common one, and that is systemd service unit
restrictions, many of which are
covered in systemd.exec.
(One pernicious aspect of systemd as a cause of these restrictions is
that they can appear in new releases of the same distribution. If a
daemon has been running happily in an older release and now has surprise
issues in a new Ubuntu LTS, I don't always remember to look at its
.service file.)
Some of systemd's protective directives simply cause failures to
do things, like access user home directories if ProtectHome=
is set to something appropriate. Hopefully your daemon complains
loudly here, reporting mysterious 'permission denied' or 'file not
found' errors. Some systemd settings can have additional, confusing
effects, like PrivateTmp=.
A standard thing I do when troubleshooting a chain of programs
executing programs executing programs is to shim in diagnostics
that dump information to /tmp, but with PrivateTmp= on, my debugging
dump files are mysteriously not there in the system-wide /tmp.
(On the other hand, a daemon may not complain about missing files
if it's expected that the files aren't always there. A mailer
usually can't really tell the difference between 'no one has .forward
files' and 'I'm mysteriously not able to see people's home
directories to find .forward files in them'.)
Sometimes you don't get explicit errors, just mysterious failures
to do some things. For example, you might set IP address access
restrictions with the intention of blocking inbound connections but
wind up also blocking DNS queries (and
this will also depend on whether or not you use systemd-resolved).
The good news is that you're mostly not going to find standard
systemd .service files for normal daemons shipped by your Linux
distribution with IP address restrictions. The bad news is that at
some point .service files may start showing up that impose IP address
restrictions with the assumption that DNS resolution is being done
via systemd-resolved as opposed to direct DNS queries.
(I expect some Linux distributions to resist this, for example
Debian, but others may declare that using systemd-resolved is now
mandatory in order to simplify things and let them harden service
configurations.)
Right now, you can usually test if this is the problem by creating
a version of the daemon's .service file with any systemd restrictions
stripped out of it and then seeing if using that version makes life
happy. In the future it's possible that some daemons will assume
and require some systemd restrictions (for instance, assuming that
they have a /tmp all of their own), making things harder to test.
On at least x86 PCs, Linux text consoles
('TTY' consoles or 'virtual consoles') support some surprising
things. One of them is doing some useful stuff with your mouse, if
you run an additional daemon such as gpm or the
more modern consolation. This is
supported on both framebuffer consoles
and old 'VGA' text consoles. The experience is fairly straightforward;
you install and activate one of the daemons, and afterward you can
wave your mouse around, select and paste text, and so on. How it
works and what you get is not as clear, and since I recently went
diving into this area for reasons, I'm going to
write down what I now know before I forget it (with a focus on how
consolation works).
The quick summary is that the console TTY's mouse support is broadly
like a terminal emulator. With a mouse daemon active, the TTY will
do "copy and paste" selection stuff on its own. A mouse aware text
mode program can put the console into a mode where mouse button
presses are passed through to the program, just as happens in xterm
or other terminal emulators.
The simplest TTY mode is when a non-mouse-aware program
or shell is active, which is to say a program that wouldn't try to
intercept mouse actions itself if it was run in a regular terminal
window and would leave mouse stuff up to the terminal emulator. In
this mode, your mouse daemon reads mouse input events and then uses
sub-options of the TIOCLINUX ioctl
to inject activities into the TTY, for example telling it to 'select'
some text and then asking it to paste that selection to some file
descriptor (normally the console itself, which delivers it to
whatever foreground program is taking terminal input at the time).
(In theory you can use the mouse to scroll text back and forth, but
in practice that was removed in 2020, both for the framebuffer
console
and for the VGA console.
If I'm reading the code correctly, a VGA console might still have
a little bit of scrollback support depending on how much spare VGA
RAM you have for your VGA console size. But you're probably not
using a VGA console any more.)
The other mode the console TTY can be in is one where some program
has used standard xterm-derived escape sequences
to ask for xterm-compatible "mouse tracking", which is the same
thing it might ask for in a terminal emulator if it wanted to handle
the mouse itself. What this does in the kernel TTY console driver
is set a flag that your mouse daemon can query with
TIOCL_GETMOUSEREPORTING; the kernel TTY driver still doesn't
directly handle or look at mouse events. Instead, consolation (or
gpm) reads the flag and, when the flag is set, uses the
TIOCL_SELMOUSEREPORT sub-sub-option to TIOCLINUX's TIOCL_SETSEL
sub-option to report the mouse position and button presses to the
kernel (instead of handling mouse activity itself). The kernel then
turns around and sends mouse reporting escape codes to the TTY, as
the program asked for.
A mouse daemon like consolation doesn't have to pay attention to
the kernel's TTY 'mouse reporting' flag. As far as I can tell from
the current Linux kernel code, if the mouse daemon ignores the flag
it can keep on doing all of its regular copy and paste selection
and mouse button handling. However, sending mouse reports is only
possible when a program has specifically asked for it; the kernel
will report an error if you ask it to send a mouse report at the
wrong time.
(As far as I can see there's no notification from the kernel to
your mouse daemon that someone changed the 'mouse reporting' flag.
Instead you have to poll it; it appears consolation does this every
time through its event loop before it handles any mouse events.)
PS: Some documentation on console mouse reporting was written as
a 2020 kernel documentation patch
(alternate version)
but it doesn't seem to have made it into the tree. According
to various sources, eg,
the mouse daemon side of things can only be used by actual mouse
daemons, not by programs, although programs do sometimes use other
bits of TIOCLINUX's mouse stuff.
PPS: It's useful to install a mouse daemon on your desktop or laptop
even if you don't intend to ever use the text TTY. If you ever wind
up in the text TTY for some reason, perhaps because your regular
display environment has exploded, having mouse cut and paste is a
lot nicer than not having it.
One of the things that Fedora is trying to do in Fedora 42 is
unifying /usr/bin and /usr/sbin. In an
ideal (Fedora) world, your Fedora machines will have /usr/sbin be
a symbolic link to /usr/bin after they're upgraded to Fedora 42.
However, if your Fedora machines have been around for a while, or
perhaps have some third party packages installed, what you'll
actually wind up with is a /usr/sbin that is mostly symbolic links
to /usr/bin but still has some actual programs left.
One source of these remaining /usr/sbin programs is old packages
from past versions of Fedora that are no longer packaged in Fedora
41 and Fedora 42. Old packages are usually harmless, so it's easy
for them to linger around if you're not disciplined; my home and
office desktops (which have been around for a while) still have packages from as far back as
Fedora 28.
(An added complication of tracking down file ownership is that some
RPMs haven't been updated for the /sbin to /usr/sbin merge and so
still believe that their files are /sbin/<whatever> instead of
/usr/sbin/<whatever>. A 'rpm -qf /usr/sbin/<whatever>' won't find
these.)
Obviously, you shouldn't remove old packages without being sure of
whether or not they're important to you. I'm also not completely
sure that all packages in the Fedora 41 (or 42) repositories are
marked as '.fc41' or '.fc42' in their RPM versions, or if there are
some RPMs that have been carried over from previous Fedora versions.
Possibly this means I should wait until a few more Fedora versions
have come to pass so that other people find and fix the exceptions.
(On what is probably my cleanest Fedora 42 test virtual machine,
there are a number of packages that 'dnf list --extras' doesn't
list that have '.fc41' in their RPM version. Some of them may
have been retained un-rebuilt for binary compatibility reasons.
There's also the 'shim' UEFI bootloaders, which date from 2024
and don't have Fedora releases in their RPM versions, but those
I expect to basically never change once created. But some others
are a bit mysterious, such as 'libblkio', and I suspect that they
may have simply been missed by the Fedora 42 mass rebuild.)
PS: In theory anyone with access to the full Fedora 42 RPM repository
could sweep the entire thing to find packages that still install
/usr/sbin files or even /sbin files, which would turn up any relevant
not yet rebuilt packages. I don't know if there's any easy way to
do this through dnf commands, although I think dnf does have access
to a full file list for all packages (which is used for certain dnf
queries).
One of the changes in Fedora Linux 42 is unifying /usr/bin and
/usr/sbin,
by moving everything in /usr/sbin to /usr/bin. To some people, this
probably smacks of anathema, and to be honest, my first reaction
was to bristle at the idea. However, the more I thought about it,
the more I had to concede that the idea of /usr/sbin has failed in
practice.
We can tell /usr/sbin has failed in practice by asking how many
people routinely operate without /usr/sbin in their $PATH. In a lot
of environments, the answer is that very few people do, because
sooner or later you run into a program that you want to run (as
yourself) to obtain useful information or do useful things. Let's
take FreeBSD 14.3 as an illustrative example (to make this not a
Linux biased entry); looking at /usr/sbin, I recognize iostat,
manctl (you might use it on your own manpages), ntpdate (which can
be run by ordinary people to query the offsets of remote servers),
pstat, swapinfo, and traceroute. There are probably others that I'm
missing, especially if you use FreeBSD as a workstation and so care
about things like sound volumes and keyboard control.
(And if you write scripts and want them to send email, you'll care
about sendmail and/or FreeBSD's 'mailwrapper', both in /usr/sbin.
There's also DTrace, but I don't know if you can DTrace your own
binaries as a non-root user on FreeBSD.)
For a long time, there has been no strong organizing principle to
/usr/sbin that would draw a hard line and create a situation where
people could safely leave it out of their $PATH. We could have had
a principle of, for example, "programs that don't work unless run
by root", but no such principle was ever followed for very long (if
at all). Instead programs were more or less shoved in /usr/sbin if
developers thought they were relatively unlikely to be used by
normal people. But 'relatively unlikely' is not 'never', and shortly
after people got told to 'run traceroute' and got 'command not
found' when they tried, /usr/sbin (probably) started appearing in
$PATH.
(And then when you asked 'how does my script send me email about
something', people told you about /usr/sbin/sendmail and another
crack appeared in the wall.)
If /usr/sbin is more of a suggestion than a rule and it appears in
everyone's $PATH because no one can predict which programs you want
to use will be in /usr/sbin instead of /usr/bin, I believe this
means /usr/sbin has failed in practice. What remains is an unpredictable
and somewhat arbitrary division between two directories, where which
directory something appears in operates mostly as a hint (a hint
that's invisible to people who don't specifically look where a
program is).
(This division isn't entirely pointless and one could try to reform
the situation in a way short of Fedora 42's "burn the entire thing
down" approach. If nothing else the split keeps the size of both
directories somewhat down.)
PS: The /usr/sbin like idea that I think is still successful in
practice is /usr/libexec. Possibly a bunch of things in /usr/sbin
should be relocated to there (or appropriate subdirectories of it).
I upgrade Fedora on my office and home workstations through an
online upgrade with dnf,
and as part of this I read (or at least scan) DNF's output to look
for problems. Usually this goes okay, but DNF5 has a general
problem with script output and when
I did a test upgrade from Fedora 41 to Fedora 42 on a virtual
machine, it generated a huge amount of repeated output from a script
run by selinux-policy-targeted, repeatedly reporting "Old compiled
fcontext format, skipping" for various .bin files in
/etc/selinux/targeted/contexts/files. The volume of output made the
rest of DNF's output essentially unreadable. I would like to avoid
this when I actually upgrade my office and home workstations to
Fedora 42 (which I still haven't done, partly because of this issue).
The 'targeted' policy is one of several SELinux policies
that are supported or at least packaged by Fedora (although I suspect
I might see similar issues with the other policies too). My main
machines don't use SELinux and I have it completely disabled, so
in theory I should be able to remove the selinux-policy-targeted
package to stop it from repeatedly complaining during the Fedora
42 upgrade process. In practice, selinux-policy-targeted is a
'protected' package that DNF will normally refuse to remove. Such
packages are listed in /etc/dnf/protected.d/ in various .conf files;
selinux-policy-targeted installs (well, includes) a .conf file to
protect itself from removal once installed.
(Interestingly, sudo protects itself but there's nothing specifically
protecting su and the rest of util-linux. I suspect util-linux is
so pervasively a dependency that other protected things hold it
down, or alternately no one has ever worried about people removing
it and shooting themselves in the foot.)
I can obviously remove this .conf file and then DNF will let me
remove selinux-policy-targeted, which will force the removal of
some other SELinux policy packages (both selinux-policy packages
themselves and some '*-selinux' sub-packages of other packages).
I tried this on another Fedora 41 test virtual machine and nothing
obvious broke, but that doesn't mean that nothing broke at all. It
seems very likely that almost no one tests Fedora without the
selinux-policy collective installed and I suspect it's not a supported
configuration.
I could reduce my risks by removing the packages only just before
I do the upgrade to Fedora 42 and put them back later (well, unless
I run into a dnf issue as a result,
although that issue is from 2024). Also, now that I've investigated
this, I could in theory delete the
.bin files in /etc/selinux/targeted/contexts/files before the
upgrade, hopefully making it so that selinux-policy-targeted has
less or nothing to complain about. Since I'm not using SELinux,
hopefully the lack of these files won't cause any problems, but of
course this is less certain a fix than removing selinux-policy-targeted
(for example, perhaps the .bin files would get automatically rebuilt
early on in the upgrade process as packages are shuffled around,
and bring the problem back with them).
Really, though, I wish DNF5 didn't have its problem with script
output. All of this is hackery to deal with that underlying issue.
The default behavior of a stock Ubuntu LTS server install is that
it enables 'unattended upgrades', by installing the package
unattended-upgrades (which creates /etc/apt/apt.conf.d/20auto-upgrades,
which controls this). Historically, we
haven't believed in unattended automatic package upgrades and
eventually built a complex semi-automated upgrades system (which has various special features). In theory this has various potential
advantages; in practice it mostly results in package upgrades being
applied after some delay that depends on when they come out relative
to working days.
I have a few machines that actually are stock Ubuntu servers, for
reasons outside the scope of this entry. These machines naturally
have automated upgrades turned on and one of them (in a cloud, using
the cloud provider's standard Ubuntu LTS image) even appears to
automatically reboot itself if kernel updates need that. These
machines are all in undemanding roles (although one of them is my
work IPv6 gateway), so they aren't
necessarily indicative of what we'd see on more complex machines,
but none of them have had any visible problems from these unattended
upgrades.
(I also can't remember the last time that we ran into a problem
with updates when we applied them. Ubuntu updates still sometimes
have regressions and other problems, forcing them to be reverted
or reissued, but so far we haven't seen problems ourselves; we find
out about these problems only through the notices in the Ubuntu
security lists.)
If we were starting from scratch today in a greenfield environment,
I'm not sure we'd bother building our automation for manual package
updates. Since we have the automation and it offers various extra
features (even if they're rarely used), we're probably not going
to switch over to automated upgrades (including in our local build
of Ubuntu 26.04 LTS when that comes out next year).
(The advantage of switching over to standard unattended upgrades
is that we'd get rid of a local tool that, like all local tools,
is all our responsibility. The less local weird things we have,
the better, especially since we have so many as it is.)
The other day I wrote about what "AppIndicator" is (a protocol) and some things about how the Cinnamon
desktop appeared to support it, except they weren't working for me.
Now I actually understand what's going on, more or less, and how
to solve my problem of a program complaining that
it needed AppIndicator.
Cinnamon directly implements the AppIndicator notification protocol
in xapp-sn-watcher, part of Cinnamon's xapp(s) package. Xapp-sn-watcher is started as
part of your (Cinnamon) session. However, it has a little feature,
namely that it will exit if no one is asking it to do anything:
XApp-Message: 22:03:57.352: (SnWatcher) watcher_startup: ../xapp-sn-watcher/xapp-sn-watcher.c:592: No active monitors, exiting in 30s
In a normally functioning Cinnamon environment, something will soon
show up to be an active monitor and stop xapp-sn-watcher from
exiting:
Cjs-Message: 22:03:57.957: JS LOG: [LookingGlass/info] Loaded applet xapp-status@cinnamon.org in 88 ms
[...]
XApp-Message: 22:03:58.129: (SnWatcher) name_owner_changed_signal: ../xapp-sn-watcher/xapp-sn-watcher.c:162: NameOwnerChanged signal received (n: org.x.StatusIconMonitor.cinnamon_0, old: , new: :1.60
XApp-Message: 22:03:58.129: (SnWatcher) handle_status_applet_name_owner_appeared: ../xapp-sn-watcher/xapp-sn-watcher.c:64: A monitor appeared on the bus, cancelling shutdown
This something is a standard Cinnamon desktop applet. In System
Settings β Applets, it's way down at the bottom and is called "XApp
Status Applet". If you've accidentally wound up with it not turned
on, xapp-sn-watcher will (probably) not have a monitor active after
30 seconds, and then it will exit (and in the process of exiting,
it will log alarming messages about failed GLib assertions). Not
having this xapp-status applet turned on was my problem, and turning
it on fixed things.
(I don't know how it got turned off. It's possible I wen through
the standard applets at some point and turned some of them off in
an excess of ignorant enthusiasm.)
As I found out from leigh scott in my Fedora bug report, the way to
get this debugging output from xapp-sn-watcher is to run 'gsettings
set org.x.apps.statusicon sn-watcher-debug true'. This will cause
xapp-sn-watcher to log various helpful and verbose things to your
~/.xsession-errors (although apparently not the fact that it's
actually exiting; you have to deduce that from the timestamps
stopping 30 seconds later and that being the timestamps on the GLib
assertion failures).
(I don't know why there's both a program and an applet involved
in this and I've decided not to speculate.)
Suppose, not hypothetically, that you start up some program on your
Fedora 42 Cinnamon
desktop and it helpfully tells you "<X> requires AppIndicator to
run. Please install the AppIndicator plugin for your desktop". You
are likely confused, so here are some notes.
'AppIndicator' itself is the name of an application notification
protocol, apparently originally from KDE, and some desktop environments
may need a (third party) extension to support it, such as the
Ubuntu one for GNOME Shell.
Unfortunately for me,
Cinnamon is not one of those desktops. It theoretically has native
support for this, implemented in /usr/libexec/xapps/xapp-sn-watcher,
part of Cinnamon's xapps package.
The actual 'AppIndicator' protocol is done over D-Bus, because that's
the modern way. Since this started as a KDE thing, the D-Bus name is
'org.kde.StatusNotifierWatcher'. What provides certain D-Bus names is
found in /usr/share/dbus-1/services, but not all names are mentioned
there and 'org.kde.StatusNotifierWatcher' is one of the missing ones.
In this case /etc/xdg/autostart/xapp-sn-watcher.desktop mentions the
D-Bus name in its 'Comment=', but that's probably not something you
can count on to find what your desktop is (theoretically) using to
provide a given D-Bus name. I found xapp-sn-watcher somewhat through
luck.
There are probably a number of ways to see what D-Bus names are
currently registered and active. The one that I used when looking
at this is 'dbus-send --print-reply --dest=org.freedesktop.DBus
/org/freedesktop/DBus org.freedesktop.DBus.ListNames'. As far as
I know, there's no easy way to go from an error message about
'AppIndicator' to knowing that you want 'org.kde.StatusNotifierWatcher';
in my case I read the source of the thing complaining which was helpfully
in Python.
I have no idea how to actually fix the problem, or if there is a
program that implements org.kde.StatusNotifierWatcher as a generic,
more or less desktop independent program the way that stalonetray does for system tray
stuff (or one generation of system tray stuff, I think there have
been several iterations of it, cf).
(Yes, I filed a Fedora bug, but I believe
Cinnamon isn't particularly supported by Fedora so I don't expect
much. I also built the latest upstream xapps tree and it also appears
to fail in the same way. Possibly this means something in the rest of
the system isn't working right.)
So, suppose that you have a brand new nflog version of OpenBSD's
pflog, so you can use tcpdump to watch
dropped packets (or in general, logged packets). And further suppose
that you specifically want to see DNS requests to your port 53. So of course you do:
# tcpdump -n -i nflog:30 'port 53'
tcpdump: NFLOG link-layer type filtering not implemented
Perhaps we can get clever by reading from the interface in one
tcpdump and sending it to another to be interpreted, forcing the
pcap filter to be handled entirely in user space instead of the
kernel:
As far as I can determine, what's going on here is that the netfilter
log system, 'NFLOG', uses a 'packet' format that isn't the same as
any of the regular formats (Ethernet, PPP, etc) and adds some
additional (meta)data about the packet to every packet you capture.
I believe the various attributes this metadata can contain are
listed in the kernel's nfnetlink_log.h.
(I believe it's not technically correct to say that this additional
stuff is 'before' the packet; instead I believe the packet is
contained in a NFULA_PAYLOAD attribute.)
Unfortunately for us, tcpdump (or more exactly libpcap) doesn't
know how to create packet capture filters for this
format, not even ones that are interpreted entirely in user space
(as happens when tcpdump reads from a file).
I believe that you have two options. First, you can use tshark with a display
filter, not a capture filter:
# tshark -i nflog:30 -Y 'udp.port == 53 or tcp.port == 53'
Running as user "root" and group "root". This could be dangerous.
Capturing on 'nflog:30'
[...]
(Tshark capture filters are subject to the same libpcap inability
to work on NFLOG formatted packets as tcpdump has.)
Alternately and probably more conveniently, you can tell tcpdump
to use the 'IPV4' datalink type instead of the default, as mentioned
in (opaque) passing in the tcpdump manual page:
# tcpdump -i nflog:30 -L
Data link types for nflog:30 (use option -y to set):
NFLOG (Linux netfilter log messages)
IPV4 (Raw IPv4)
# tcpdump -i nflog:30 -y ipv4 -n 'port 53'
tcpdump: data link type IPV4
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on nflog:30, link-type IPV4 (Raw IPv4), snapshot length 262144 bytes
[...]
Of course this is only applicable if you're only doing IPv4. If you
have some IPv6 traffic that you want to care about, I think you
have to use tshark display filters (which means learning how to
write Wireshark display filters, something I've avoided so far).
I think there is some potentially useful information in the extra
NFLOG data, but to get it or to filter on it I think you'll need
to use tshark (or Wireshark) and consult the NFLOG display filter
reference,
although that doesn't seem to give you access to all of the NFLOG
stuff that 'tshark -i nflog:30 -V' will print about packets.
(Or maybe the trick is that you need to match 'nflog.tlv_type ==
<whatever> and nflog.tlv_value == <whatever>'. I believe that
some NFLOG attributes are available conveniently, such as 'nflog.prefix',
which corresponds to NFULA_PREFIX. See packet-nflog.c.)
OpenBSD's and FreeBSD's PF system has a very convenient 'pflog'
feature, where you put in a 'log' bit in a PF rule and this
dumps a copy of any matching packets into a pflog pseudo-interface, where you can
both see them with 'tcpdump -i pflog0' and have them automatically
logged to disk by pflogd in
pcap format. Typically we use this to log blocked packets, which
gives us both immediate and after the fact visibility of what's
getting blocked (and by what rule,
also). It's possible to mostly
duplicate this in Linux nftables,
although with more work and there's less documentation on it.
The first thing you need is nftables rules with one or two log
statements
of the form 'log group <some number>'. If you want to be able to
both log packets for later inspection and watch them live, you need
two 'log group' statements with different numbers; otherwise you
only need one. You can use different (group) numbers on different
nftables rules if you want to be able to, say, look only at accepted
but logged traffic or only dropped traffic. In the end this might
wind up looking something like:
tcp port ssh counter log group 30 log group 31 drop;
As the nft manual page will tell you, this uses the kernel
'nfnetlink_log' to forward the 'logs' (packets) to a netlink
socket, where exactly one process (at most) can subscribe to a
particular group to receive those logs (ie, those packets). If we
want to both log the packets and be able to tcpdump them, we need
two groups so we can have ulogd getting one and
tcpdump getting the other.
To see packets from any particular log group, we use the special
'nflog:<N>' pseudo-interface that's hopefully supported by your
Linux version of tcpdump. This is used as 'tcpdump -i nflog:30
...' and works more or less like you'd want it to. However, as
far as I know there's no way to see meta-information about the
nftables filtering, such as what rule was involved or what the
decision was; you just get the packet.
To log the packets to disk for later use, the default program is
ulogd, which in Ubuntu is called 'ulogd2'. Ulogd(2) isn't as automatic
as OpenBSD's and FreeBSD's pf logging; instead you have to configure
it in /etc/ulogd.conf, and on Ubuntu make sure you have the
'ulogd2-pcap' package installed (along with ulogd2 itself). Based
merely on getting it to work, what you want in /etc/ulogd.conf is
the following three bits:
# A 'stack' of source, handling, and destination
stack=log31:NFLOG,base1:BASE,pcap31:PCAP
# The source: NFLOG group 31, for IPv4 traffic
[log31]
group=31
# addressfamily=10 for IPv6
# the file path is correct for Ubuntu
[pcap31]
file="/var/log/ulog/ulogd.pcap"
sync=0
(On Ubuntu 24.04, any .pcap files in /var/log/ulog will be automatically
rotated by logrotate, although I think by default it's only weekly,
so you might want to make it daily.)
The ulogd documentation suggests that you will need to capture IPv4
and IPv6 traffic separately, but I've only used this on IPv4 traffic
so I don't know. This may imply that you need separate nftables
rules to log (and drop) IPv6 traffic so that you can give it a
separate group number for ulogd (I'm not sure if it needs a separate
one for tcpdump or if tcpdump can sort it out).
Ulogd can also log to many different things than PCAP format,
including JSON and databases. It's possible that there are ways to
enrich the ulogd pcap logs, or maybe just the JSON logs, with
additional useful information such as the network interface involved
and other things. I find the ulogd documentation somewhat opaque
on this (and also it's incomplete), and I haven't experimented.
(According to this,
the JSON logs can be enriched or maybe default to that.)
Given the assorted limitations and other issues with ulogd, I'm
tempted to not bother with it and only have our nftables setups
support live tcpdump of dropped traffic with a single 'log group
<N>'. This would save us from the assorted annoyances of ulogd2.
PS: One reason to log to pcap format files is that then you can
use all of the tcpdump filters that you're already familiar with
in order to narrow in on (blocked) traffic of interest, rather
than having to put together a JSON search or something.
These days, nftables is the Linux
network firewall system that you want to use, and especially it's
the system that Ubuntu will use by default even if you use the
'iptables' command. The nft command is
the official interface to nftables, and it has a 'nft list ruleset'
sub-command that will list your NFT rules. Since iptables rules are
implemented with nftables, you might innocently expect that 'nft list
ruleset' will show you the proper NFT syntax to achieve your current
iptables rules.
Well, about that:
# iptables -vL INPUT
[...] target prot opt in out source destination
[...] ACCEPT tcp -- any any anywhere anywhere match-set nfsports dst match-set nfsclients src
# nft list ruleset
[...]
ip protocol tcp xt match "set" xt match "set" counter packets 0 bytes 0 accept
[...]
This represents an xt statement from xtables compat interface. It is a
fallback if translation is not available or not complete. Seeing this
means the ruleset (or parts of it) were created by iptables-nft and
one should use that to manage it.
Nftables has a native set type (and
also maps),
but, quite reasonably, the old iptables 'ipset' stuff isn't translated to nftables sets
by the iptables compatibility layer. Instead the compatibility layer
uses this 'xt match' magic that the nft command can only imperfectly
tell you about. To nft's credit, it prints a warning comment (which
I've left out) that the rules are being managed by iptables-nft and
you shouldn't touch them. Here, all of the 'xt match "set"' bits
in the nft output are basically saying "opaque stuff happens here".
This still makes me a little bit sad because it makes it that bit
harder to bootstrap my nftables knowledge from what iptables rules
convert into. If I wanted to switch to nftables rules and nftables
sets (for example for my now-simpler desktop firewall rules), I'd have to do that from relative
scratch instead of getting to clean up what the various translation
tools would produce or report.
(As a side effect it makes it less likely that I'll convert various
iptables things to being natively nft/nftables based, because I
can't do a fully mechanical conversion. If they still work with
iptables-nft, I'm better off leaving them as is. Probably this also
means that iptables-nft support is likely to have a long, long
life.)
I use an eccentric X 'desktop' that
is not really a desktop as such in the usual sense but instead a
window manager and various programs that I run
(as a sysadmin, there's a lot of terminal windows). One of the ways
that my desktop is unusual is in how I exit from my X session.
First, I don't use xdm or any other graphical login
manager; instead I run my session through xinit. When
you use an xinit based session, you give xinit a program or a script
to run, and when the program exits, xinit terminates the X server
and your session.
(If you gave xinit a shell script, whatever foreground program the
script ended with was your keystone program.)
Traditionally, this keystone program for your X session was your
window manager. At one level this makes a lot of sense; your window
manager is basically the core of your X session anyway, so you might
as well make quitting from it end the session. However, for a very
long time I've used a do-nothing iconified xterm running a shell
as my keystone program.
The minor advantage to having an otherwise unused xterm as my session
keystone program is that I can start my window manager basically
at the start of my (rather complex) session startup, so that I can
immediately have it manage all of the other things I start (technically
I run a number of commands to set up X settings before I start fvwm,
but it's the first program I start that will actually show anything
on the screen). The big advantage is that using something else as
my keystone program means that I can kill and restart my window
manager if something goes badly wrong, and more generally that I
don't have to worry about restarting it. This doesn't happen very
often, but when it does happen I'm very glad that I can recover my
session instead of having to abruptly terminate everything. And
should I have to terminate fvwm, this 'console' xterm is a convenient
idle xterm in which to restart it (or in general, any other program
of my session that needs restarting).
(The 'console' xterm is deliberately placed up at the top of the
screen, in an area that I don't normally put non-fvwm windows in,
so that if fvwm exits and everything de-iconifies, it's highly
likely that this xterm will be visible so I can type into it. If
I put it in an ordinary place, it might wind up covered up by a
browser window or another xterm or whatever.)
I don't particularly have to use an (iconified) xterm with a shell
in it; I could easily have written a little Tk program that displayed
a button saying 'click me to exit'. However, the problem with such
a program (and the advantage of my 'console' xterm) is that it would
be all too easy to accidentally click the button (and force-end my
session). With the iconified xterm, I need to do a bunch of steps
to exit; I have to deiconify that xterm, focus the window, and
Ctrl-D the shell to make it exit (causing the xterm to exit). This
is enough out of the way that I don't think I've ever done it by
accident.
PS: I believe modern desktop environments like GNOME, KDE, and
Cinnamon have moved away from making their window manager be the
keystone program and now use a dedicated session manager program
that things talk to. One reason for this may be that modern desktop
shells seem to be rather more prone to crashing for various reasons,
which would be very inconvenient if that ended your session. This
isn't all bad, at least if there's a standard D-Bus protocol for
ending a session so that you can write an 'exit the session' thing
that will work across environments.
In my entry on getting decent error reports in Bash for 'set -e', I said that even if you were
on a system where /bin/sh was Bash and so my entry worked if you
started your script with '#!/bin/sh', you should use '#!/bin/bash'
instead for various reasons. A commentator took issue with this
direct invocation of Bash and suggested '#!/usr/bin/env bash'
instead. It's my view that using env this way, especially for Bash,
is rarely useful and thus is almost always unnecessary and pointless
(and sometimes dangerous).
The only reason to start your script with '#!/usr/bin/env <whatever>'
is if you expect your script to run on a system where Bash or
whatever else isn't where you expect (or when it has to run on
systems that have '<whatever>' in different places, which is probably
most common for third party packages). Broadly speaking this only
happens if your script is portable and will run on many different
sorts of systems. If your script is specific to your systems (and
your systems are uniform), this is pointless; you know where Bash
is and your systems aren't going to change it, not if they're sane.
The same is true if you're targeting a specific Linux distribution,
such as 'this is intrinsically an Ubuntu script'.
(In my case, the script I was doing this to is intrinsically
specific to Ubuntu and our environment. It will never run on
anything else.)
It's also worth noting that '#!/usr/bin/env <whatever>' only works
if (the right version of) <whatever> can be found on your $PATH,
and in fact the $PATH of every context where you will run the script
(including, for example, from cron). If the system's default $PATH
doesn't include the necessary directories, this will likely fail
some of the time. This makes using 'env' especially dangerous in
an environment where people may install their own version of
interpreters like Python, because your script's use of 'env' may
find their Python on their $PATH instead of the version that you
expect.
(These days, one of the dangers with Python specifically is that
people will have a $PATH that (currently) points to a virtual
environment with some random selection of Python packages installed
and not installed, instead of the system set of packages.)
As a practical matter, pretty much every mainstream Linux distribution
has a /bin/bash (assuming that you install Bash, and I'm sorry, Nix
and so on aren't mainstream). If you're targeting Linux in general,
assuming /bin/bash exists is entirely reasonable. If a Linux
distribution relocates Bash, in my view the resulting problems are
on them. A lot of the time, similar things apply for other
interpreters, such as Python, Perl, Ruby, and so on. '#!/usr/bin/python3'
on Linux is much more likely to get you a predictable Python
environment than '#!/usr/bin/env python3', and if it fails it
will be a clean and obvious failure that's easy to diagnose.
Another issue is that even if your script is fixed to use 'env' to
run Bash, it may or may not work in such an alternate environment
because other things you expect to find in $PATH may not be there.
Unless you're actually testing on alternate environments (such as
Nix or FreeBSD), using 'env' may suggest more portability than
you're actually able to deliver.
My personal view is that for most people, '#!/usr/bin/env' is a
reflexive carry-over that they inherited from a past era of
multi-architecture Unix environments,
when much less was shipped with the system and so was in predictable
locations. In that past Unix era, using '#!/usr/bin/env python' was
a reasonably sensible thing; you could hope that the person who
wanted to run your script had Python, but you couldn't predict
where. For most people, those days are over, especially for scripts
and programs that are purely for your internal use and that you
won't be distributing to the world (much less inviting people to
run your 'written on X' script on a Y, such as a FreeBSD script
being run on Linux).
A commentator on my 2024 entry on the uncertain possible futures
of Unix graphical desktops brought up the
XLibre project. XLibre is ostensibly a fork of the X server that
will be developed by a new collection of people, which on the surface
sounds unobjectionable and maybe a good thing for people (like me)
who want X to keep being viable; as a result it has gotten a certain
amount of publicity from credulous sources who don't look behind
the curtain. Unfortunately for everyone, XLibre is an explicitly
political project,
and I don't mean that in the sense of disagreements about technical
directions (the sense that you could say that 'forking is a political
action', because it's the manifestation of a social disagreement).
Instead I mean it in the regular sense of 'political', which is that
the people involved in XLibre (especially its leader) have certain
social values and policies that they espouse, and the XLibre project
is explicitly manifesting some of them.
I am not going to summarize here; instead, you should read the
Register article and its links,
and also the relevant sections of Ariadne Conill's announcement
of Wayback
and their links. However, even if you "don't care" about politics,
you should see this correction to earlier XLibre changes where the person
making the earlierchanges
didn't understand what '2^16' did in C (I would say that the
people who reviewed the changes also missed it, but there didn't
seem to be anyone doing so, which ought to raise your eyebrows when
it comes to the X server).
Using, shipping it as part of a distribution, or advocating for
XLibre is not a neutral choice. To do so is to align yourself,
knowingly or unknowingly, with the politics of XLibre and with the
politics of its leadership and the people its leadership will attract
to the project. This is always true to some degree with any project,
but it's especially true when the project is explicitly manifesting
some of its leadership's values, out in the open. You can't detach
XLibre from its leader .
My personal view is that I don't want to have anything to do with
XLibre and I will think less of any Unix or Linux distribution that
includes it, especially ones that intend to make it their primary
X server. At a minimum, I feel those distributions haven't done
their due diligence.
In general, my personal guess is that a new (forked) standalone X
server is also the wrong approach to maintaining a working X server
environment over the long term. Wayback combined with XWayland seems
like a much more stable base because each of them has more support
in various ways (eg, there are a lot of people who are going to
want old X programs to keep working for years or decades to come
and so lots of demand for most of XWayland's features).
(This elaborates on my comment on XLibre in this entry. I also think that a viable X based environment
is far more likely to stop working due to important programs becoming
Wayland-only than because you can no longer get a working X server.)
Since Union finance minister Nirmala Sitharaman's announcement last week that India's Goods and Services Tax (GST) rates will be rationalised anew from September 22, I've been seeing a flood of pieces all in praise β and why not?
The GST regime has been somewhat controversial since its launch because, despite simplifying compliance for businesses and industry, it increased the costs for consumers. The Indian government exacerbated that pain point by undermining the fiscal federalism of the Union, increasing its revenues at the expense of states' as well as cutting allocations.
While there is (informed) speculation that the next Finance Commission will further undercut the devolution of funds to the states, GST 2.0 offers some relief to consumers in the form of making various products more affordable. Populism is popular, after all.
However, increasing affordability isn't always a good thing even if your sole goal is to increase consumption. This is particularly borne out in the food and nutrition domain.
For example, under the new tax regime, from September 22, the GST on pizza bread will slip from 5% to zero. This means both sourdough pizza bread and maida (refined flour) pizza bread will go from 5% to zero. However, because there is more awareness of maida as an ingredient in the populace and less so of sourdough, and because maida as a result enjoys a higher economy of scale and is thus less expensive (before tax), the demand for maida bread is likely to increase more than the demand for sourdough bread.
This is unfortunate: ideally, sourdough bread should be more affordable β or, alternatively, the two breads should be equally affordable as well as have threshold-based front-of-pack labelling. That is to say, liberating consumers to be able to buy new food products or more of the old ones without simultaneously empowering consumers to make more informed choices could tilt demand in favour of unhealthier foods.
Ultimately, the burden of non-communicable diseases in the population will increase, as will consumers' expenses on healthcare, dietary interventions, and so on. I explained this issue in The Hindu on September 9, 2025, and set out solutions that the Indian government must implement in its food regulation apparatus posthaste.
Without these measures, GST 2.0 will likely be bad news for India's dietary and nutritional ambitions.
In science, paradoxes often appear when familiar rules are pushed into unfamiliar territory. One of them is Parrondoβs paradox, a curious mathematical result showing that when two losing strategies are combined, they can produce a winning outcome. This might sound like trickery but the paradox has deep connections to how randomness and asymmetry interact in the physical world. In fact its roots can be traced back to a famous thought experiment explored by the US physicist Richard Feynman, who analysed whether one could extract useful work from random thermal motion. The link between Feynmanβs thought experiment and Parrondoβs paradox demonstrates how chance can be turned into order when the conditions are right.
Imagine two games. Each game, when played on its own, is stacked against you. In one, the odds are slightly less than fair, e.g. you win 49% of the time and lose 51%. In another, the rules are even more complex, with the chances of winning and losing depending on your current position or capital. If you keep playing either game alone, the statistics say you will eventually go broke.
But then thereβs a twist. If you alternate the games β sometimes playing one, sometimes the other β your fortune can actually grow. This is Parrondoβs paradox, proposed in 1996 by the Spanish physicist Juan Parrondo.
The answer to how combining losing games can result in a winning streak lies in how randomness interacts with structure. In Parrondoβs games, the rules are not simply fair or unfair in isolation; they have hidden patterns. When the games are alternated, these patterns line up in such a way that random losses become rectified into net gains.
Say thereβs a perfectly flat surface in front of you. You place a small bead on it and then you constantly jiggle the surface. The bead jitters back and forth. Because the noise youβre applying to the beadβs position is unbiased, the bead simply wanders around in different directions on the surface. Now, say you introduce a switch that alternates the surface between two states. When the switch is ON, an ice-tray shape appears on the surface. When the switch is OFF, it becomes flat again. This ice-tray shape is special: the cups are slightly lopsided because thereβs a gentle downward slope from left to right in each cup. At the right end, thereβs a steep wall. If youβre jiggling the surface when the switch is OFF, the bead diffuses a little towards the left, a little towards the right, and so on. When you throw the switch to ON, the bead falls into the nearest cup. Because each cup is slightly tilted towards the right, the bead eventually settles near the steep wall there. Then you move the switch to OFF again.
As you repeat these steps with more and more beads over time, youβll see they end up a little to the right of where they started. This is Parrandoβs paradox. The jittering motion you applied to the surface caused each bead to move randomly. The switch you used to alter the shape of the surface allowed you to expend some energy in order to rectify the beadsβ randomness.
The reason why Parrondoβs paradox isnβt just a mathematical trick lies in physics. At the microscopic scale, particles of matter are in constant, jittery motion because of heat. This restless behaviour is known as Brownian motion, named after the botanist Robert Brown, who observed pollen grains dancing erratically in water under a microscope in 1827. At this scale, randomness is unavoidable: molecules collide, rebound, and scatter endlessly.
Scientists have long wondered whether such random motion could be tapped to extract useful work, perhaps to drive a microscopic machine. This was Feynmanβs thought experiment as well, involving a device called the Brownian ratchet, a.k.a. the Feynman-Smoluchowski ratchet. The Polish physicist Marian Smoluchowski dreamt up the idea in 1912 and which Feynman popularised in a lecture 50 years later, in 1962.
Picture a set of paddles immersed in a fluid, constantly jolted by Brownian motion. A ratchet and pawl mechanism is attached to the paddles (see video below). The ratchet allows the paddles to rotate in one direction but not the other. It seems plausible that the random kicks from molecules would turn the paddles, which the ratchet would then lock into forward motion. Over time, this could spin a wheel or lift a weight.
In one of his physics famous lectures in 1962, Feynman analysed the ratchet. He showed that the pawl itself would also be subject to Brownian motion. It would jiggle, slip, and release under the same thermal agitation as the paddles. When everything is at the same temperature, the forward and backward slips would cancel out and no net motion would occur.
This insight was crucial: it preserved the rule that free energy canβt be extracted from randomness at equilibrium. If motion is to be biased in only one direction, there needs to be a temperature difference between different parts of the ratchet. In other words, random noise alone isnβt enough: you also need an asymmetry, or what physicists call nonequilibrium conditions, to turn randomness into work.
Letβs return to Parrondoβs paradox now. The paradoxical games are essentially a discrete-time abstraction of Feynmanβs ratchet. The losing games are like unbiased random motion: fluctuations that on their own canβt produce net gain because the gains become cancelled out. But when theyβre alternated cleverly, they mimic the effect of adding asymmetry. The combination rectifies the randomness, just as a physical ratchet can rectify the molecular jostling when a gradient is present.
This is why Parrondo explicitly acknowledged his inspiration from Feynmanβs analysis of the Brownian ratchet. Where Feynman used a wheel and pawl to show how equilibrium noise canβt be exploited without a bias, Parrondo created games whose hidden rules provided the bias when they were combined. Both cases highlight a universal theme: randomness can be guided to produce order.
The implications of these ideas extend well beyond thought experiments. Inside living cells, molecular motors like kinesin and myosin actually function like Brownian ratchets. These proteins move along cellular tracks by drawing energy from random thermal kicks with the aid of a chemical energy gradient. They demonstrate that life itself has evolved ways to turn thermal noise into directed motion by operating out of equilibrium.
Parrondoβs paradox also has applications in economics, evolutionary biology, and computer algorithms. For example, alternating between two investment strategies, each of which is poor on its own, may yield better long-term outcomes if the fluctuations in markets interact in the right way. Similarly, in genetics, when harmful mutations alternate in certain conditions, they can produce beneficial effects for populations. The paradox provides a framework to describe how losing at one level can add up to winning at another.
Feynmanβs role in this story is historical as well as philosophical. By dissecting the Brownian ratchet, he demonstrated how deeply the laws of thermodynamics constrain whatβs possible. His analysis reminded physicists that intuition about randomness can be misleading and that only careful reasoning could reveal the real rules.
In 2021, a group of scientists from Australia, Canada, France, and Germany wrote in Cancers that the mathematics of Parrondoβs paradox could also illuminate the biology of cancerous tumours. Their starting point was the observation that cancer cells behave in ways that often seem self-defeating: they accumulate genetic and epigenetic instability, devolve into abnormal states, sometimes stop dividing altogether, and often migrate away from their original location and perish. Each of these traits looks like a βlosing strategyβ β yet cancers that use these βstrategiesβ together are often persistent.
The group suggested that the paradox arises because cancers grow in unstable, hostile environments. Tumour cells deal with low oxygen, intermittent blood supply, attacks by the immune system, and toxic drugs. In these circumstances, no single survival strategy is reliable. A population of only stable tumour cells would be wiped out when the conditions change. Likewise a population of only unstable cells would collapse under its own chaos. But by maintaining a mix, the group contended, cancers achieve resilience. Stable, specialised cells can exploit resources efficiently while unstable cells with high plasticity constantly generate new variations, some of which could respond better to future challenges. Together, the team continued, the cancer can alternate between the two sets of cells so that it can win.
The scientists also interpreted dormancy and metastasis of cancers through this lens. Dormant cells are inactive and can lie hidden for years, escaping chemotherapy drugs that are aimed at cells that divide. Once the drugs have faded, they restart growth. While a migrating cancer cell has a high chance of dying off, even one success can seed a tumor in a new tissue.
On the flip side, the scientists argued that cancer therapy can also be improved by embracing Parrondoβs paradox. In conventional chemotherapy, doctors repeatedly administer strong drugs, creating a strategy that often backfires: the therapy kills off the weak, leaving the strong behind β but in this case the strong are the very cells you least want to survive. By contrast, adaptive approaches that alternate periods of treatment with rest or that mix real drugs with harmless lookalikes could harness evolutionary trade-offs inside the tumor and keep it in check. Just as cancer may use Parrondoβs paradox to outwit the body, doctors may one day use the same paradox to outwit cancer.
On August 6, physicists from Lanzhou University in China published a paper in Physical Review E discussing just such a possibility. They focused on chemotherapy, which is usually delivered in one of two main ways. The first, called the maximum tolerated dose (MTD), uses strong doses given at intervals. The second, called low-dose metronomic (LDM), uses weaker doses applied continuously over time. Each method has been widely tested in clinics and each one has drawbacks.
MTD often succeeds at first by rapidly killing off drug-sensitive cancer cells. In the process, however, it also paves the way for the most resistant cancer cells to expand, leading to relapse. LDM on the other hand keeps steady pressure on a tumor but can end up either failing to control sensitive cells if the dose is too low or clearing them so thoroughly that resistant cells again dominate if the dose is too strong. In other words, both strategies can be losing games in the long run.
The question the studyβs authors asked was whether combining these two flawed strategies in a specific sequence could achieve better results than deploying either strategy on its own. This is the sort of situation Parrondoβs paradox describes, even if not exactly. While the paradox is concerned with combining outright losing strategies, the study has discussed combining two ineffective strategies.
To investigate, the researchers used mathematical models that treated tumors as ecosystems containing three interacting populations: healthy cells, drug-sensitive cancer cells, and drug-resistant cancer cells. They applied equations from evolutionary game theory that tracked how the fractions of these groups shifted in different conditions.
The models showed that in a purely MTD strategy, the resistant cells soon took over, and in a purely LDM strategy, the outcomes depended strongly on drug strength but still ended badly. But when the two schedules were alternated, the tumor behaved differently. The more sensitive cells were suppressed but not eliminated while their persistence prevented the resistant cells from proliferating quickly. The team also found that the healthy cells survived longer.
Of course, tumours are not well-mixed soups of cells; in reality they have spatial structure. To account for this, the team put together computer simulations where individual cells occupied positions on a grid; grew, divided or died according to fixed rules; and interacted with their neighbours. This agent-based approach allowed the team to examine how pockets of sensitive and resistant cells might compete in more realistic tissue settings.
Their simulations only confirmed the previous set of results. A therapeutic strategy that alternated between MTD and LDM schedules extended the amount of time before the resistant cells took over and while the healthy cells dominated. When the model started with the LDM phase in particular, the sensitive cancer cells were found to compete with the resistant cancer cells and the arrival of the MTD phase next applied even more pressure on the latter.
This is an interesting finding because it suggests that the goal of therapy may not always be to eliminate every sensitive cancer cell as quickly as possible but, paradoxically, that sometimes it may be wiser to preserve some sensitive cells so that they can compete directly with resistant cells and prevent them from monopolising the tumor. In clinical terms, alternating between high- and low-dose regimens may delay resistance and keep tumours tractable for longer periods.
Then again this is cancer β the βemperor of all maladiesβ β and in silico evidence from a physics-based model is only the start. Researchers will have to test it in real, live tissue in animal models (or organoids) and subsequently in human trials. They will also have to assess whether certain cancers, followed by a specific combination of drugs for those cancers, will benefit more (or less) from taking the Parrandoβs paradox way.
[University of London mathematical oncologist Robert] Noble β¦ says that the method outlined in the new study may not be ripe for a real-world clinical setting. βThe alternating strategy fails much faster, and the tumor bounces back, if you slightly change the initial conditions,β adds Noble. Liu and colleagues, however, plan to conduct in vitro experiments to test their mathematical model and to select regimen parameters that would make their strategy more robust in a realistic setting.
Union finance minister Nirmala Sitharaman announced sweeping changes to the GST rates on September 3. However, I think the rate for software services (HSN 99831) will remain unchanged at 18%. This is a bummer because every time I renew my WordPress.com site or purchase software over the internet in rupees, the total cost increases by almost a fifth.
The disappointment is compounded by the fact that WordPress.com and many other software service providers provide adjusted rates for users in India in order to offset the country's lower purchasing power per capita. For example, the lowest WordPress and Ghost plans by WordPress.com and MagicPages.co, respectively, cost $4 and $12 a month. But for users in India, the WordPress.com plan costs Rs 200 a month while MagicPages.co offers a Rs 450 per month plan, both with the same feature set β a big difference. The 18% GST however wipes out some, not all, of these gains.
Paying for software services over the internet when they're billed in dollars rather than rupees isn't much different. While GST doesn't apply, the rupee-to-dollar rate has become abysmal. [Checks] Rs 88.14 to the dollar at 11 am. Ugh.
I also hoped for a GST rate cut on software services because if content management software in particular becomes more affordable, more people would be able to publish on the internet.
Build a bulletproof backup server with FreeBSD, ZFS, and jails. Complete guide covering encryption, security hardening, and multiple backup strategies for enterprise-grade data protection.
Up through V6, the exec system call and family of system calls took
two arguments, the path and the argument list; we can see this in
both the V6 exec(2) manual page and
the implementation of the system call in the kernel. As
bonus trivia, it appears that the V6 exec() limited you to 510
characters of arguments (and probably V1 through V5 had a similarly
low limit, but I haven't looked at their kernel code).
In V7, the exec(2) manual page now
documents a possible third argument, and the kernel implementation
is much more complex, plus
there's an environ(5) manual page about it.
Based on h/param.h, V7
also had a much higher size limit on the combined sized of arguments
and environment variables, which isn't all that surprising given
the addition of the environment. Commands like login.c were
updated to put some things into the new environment; login sets a
default $PATH and a $HOME, for example, and environ(5)
documents various other uses (which I haven't checked in the source
code).
This implies that the V7 shell is where $PATH first appeared in
Unix, where the manual page
describes it as 'the search path for commands'. This might make you
wonder how the V6 shell handled
locating commands, and where it looked for them. The details are
helpfully documented in the V6 shell manual page,
and I'll just quote what it has to say:
If the first argument is the name of an executable file, it is
invoked; otherwise the string `/bin/' is prepended to the argument.
(In this way most standard commands, which reside in `/bin', are
found.) If no such command is found, the string `/usr' is further
prepended (to give `/usr/bin/command') and another attempt is made to
execute the resulting file. (Certain lesser-used commands live in
`/usr/bin'.)
('Invoked' here is carrying some extra freight, since this may not
involve a direct kernel exec of the file. An executable file that
the kernel didn't like would be directly run by the shell.)
I suspect that '$PATH' was given such as short name (instead of a
longer, more explicit one) simply as a matter of Unix style at the
time. Pretty much everything in V7 was terse and short in this style
for various reasons, and verbose environment variable names would
have reduced that limited exec argument space.
There's a certain sort of person who feels that the platonic ideal
of Unix is somewhere around Research Unix V7 and it's almost all
been downhill since then (perhaps with the exception of further
Research Unixes and then Plan 9, although very few people got their
hands on any of them). For all that I like Unix and started using
it long ago when it was simpler (although not as far back as V7),
I reject this view and think it's completely mistaken.
(Some of these needs were for features and some of them were for
performance. The original V7 filesystem was quite simple but also
suffered from performance issues, ones that often got worse over
time.)
I'll agree that the path that the growth of Unix has taken since
V7 is not necessarily ideal; we can all point to various things
about modern Unixes that we don't like. Any particular flaws came
about partly because people don't necessarily make ideal decisions
and partly because we haven't necessarily had perfect understandings
of the problems when people had to do something, and then once
they'd done something they were constrained by backward compatibility.
(In some ways Plan 9 represents 'Unix without the constraint of
backward compatibility', and while I think there are a variety of
reasons that it failed to catch on in the world, that lack of
compatibility is one of them. Even if you had access to Plan 9, you
had to be fairly dedicated to do your work in a Plan 9 environment
(and that was before the web made it worse).)
PS: It's my view that the people who are pushing various Unixes
forward aren't incompetent, stupid, or foolish. They're rational
and talented people who are doing their best in the circumstances
that they find themselves. If you want to throw stones, don't throw
them at the people, throw them at the overall environment that
constrains and shapes how everything in this world is pushed to
evolve. Unix is far from the only thing shaped in potentially
undesirable ways by these forces; consider, for example, C++.
(It's also clear that a lot of people involved in the historical
evolution of BSD and other Unixes were really quite smart, even if
you don't like, for example, the BSD sockets API.)
Linux kernel NFS: we don't have mandatory locks.
Also Linux kernel NFS: if the server has delegated a file to a NFS
client that's now not responding, good luck writing to the file from
any other machine. Your writes will hang.
NFS v4 delegations are
an feature where the NFS server, such as your Linux fileserver, hands a lot of authority over a particular
file over to a client that is using that file. There are various
sorts of delegations, but even a basic read delegation will force
the NFS server to recall the delegation if anything else wants
to write to the file or to remove it. Recalling a delegation requires
notifying the NFS v4 client that it has lost the delegation and
then having the client accept and respond to that. NFS v4 clients
have to respond to the loss of a delegation because they may be
holding local state that needs to be flushed back to the NFS server
before the delegation can be released.
(After all the NFS v4 server promised the client 'this file is yours
to fiddle around with, I will consult you before touching it'.)
Under some circumstances, when the NFS v4 server is unable to contact
the NFS v4 client, it will simply sit there waiting and as part of
that will not allow you to do things that require the delegation
to be released. I don't know if there's a delegation recall timeout,
although I suspect that there is, and I don't know how to find out
what the timeout is, but whatever the value is, it's substantial
(it may be the 90 second 'default lease time' from
nfsd4_init_leases_net(),
or perhaps the 'grace', also probably 90 seconds, or perhaps the
two added together).
(90 seconds is not what I consider a tolerable amount of time for
my editor to completely freeze when I tell it to write out a new
version of the file. When NFS is involved, I will typically assume
that something has gone badly wrong well before then.)
As mentioned, the NFS v4 RFC also explicitly notes that NFS v4
clients may have to flush file state in order to release their
delegation, and this itself may take some time. So even without an
unavailable client machine, recalling a delegation may stall for
some possibly arbitrary amount of time (depending on how the NFS
v4 server behaves; the RFC encourages NFS v4 servers to not be hasty
if the client seems to be making a good faith effort to clear its
state). Both the slow client recall and the hung client recall can
happen even in the absence of any actual file locks; in my case,
the now-unavailable client merely having read from the file was
enough to block things.
This blocking recall is effectively a mandatory lock, and it affects
both remote operations over NFS and local operations on the fileserver
itself. Short of waiting out whatever timeout applies, you have two
realistic choices to deal with this (the non-realistic choice is
to reboot the fileserver). First, you can bring the NFS client back
to life, or at least something that's at its IP address and responds
to the server with NFS v4 errors. Second, I believe you can force
everything from the client to expire through /proc/fs/nfsd/clients/<ID>,
by writing 'expire' to the client's 'ctl' file. You can find the
right client ID by grep'ing for something in all of the clients/*/info
files.
Discovering this makes me somewhat more inclined than before to
consider entirely disabling 'leases', the underlying kernel feature
that is used to implement these NFS v4 delegations (I discovered
how to do this when investigating NFS v4 client locks on the
server). This will also affect local
processes on the fileserver, but that now feels like a feature since
hung NFS v4 delegation recalls will stall or stop even local
operations.
Suppose that you're on Ubuntu 24.04, using NFS v4 filesystems mounted
from a Linux NFS fileserver, and at some point
you do a 'ls -l' or a 'ls -ld' of something you don't own. You may then
be confused and angered:
(There are situations where this doesn't happen or doesn't repeat,
which I don't understand but which I'm assuming are NFS caching in
action.)
If you apply strace to the problem, you'll find that the failing
system call is listxattr(2), which
is trying to list 'extended attributes'. On Ubuntu 24.04, ls comes
from Coreutils, and
Coreutils apparently started using listxattr() in version 9.4.
The Linux NFS v4 code supports extended attributes (xattrs), which
are from RFC 8276; they're
supported in both the client and the server since mid-2020 if I'm
reading git logs correctly. Both the normal Ubuntu 22.04 LTS and
24.04 LTS server kernels are recent enough to include this support
on both the server and clients, and I don't believe there's any way
to turn just them off in the kernel server (although if you disable
NFS v4.2 they may disappear too).
However, the NFS v4 server doesn't treat listxattr() operations the
way the kernel normally does. Normally, the kernel will let you do
listxattr() on an object (a directory, a file, etc) that you don't
have read permissions on, just as it will let you do stat() on it.
However, the NFS v4 server code specifically requires that you have
read access to the object. If you don't, you get EACCES (no second
S).
(The sausage is made in nfsd_listxattr() in fs/nfsd/vfs.c,
specifically in the fh_verify() call that uses NFSD_MAY_READ
instead of NFSD_MAY_NOP, which is what eg GETATTR uses.)
Normally we'd have found this last year, but we've been slow to
roll out Ubuntu 24.04 LTS machines and apparently until now no one
ever did a 'ls -l' of unreadable things on one of them (well, on a
NFS mounted filesystem).
(This elaborates on a Fediverse post. Our patch is
somewhat different than the official one.)
I've used OpenZFS on my office and home desktops (on Linux) for
what is a long time now, and over that time I've consistently used
the development version of OpenZFS, updating to the latest git tip
on a regular basis (cf). There have been
occasional issues but I've said, and continue to say, that the code
that goes into the development version is generally well tested and
I usually don't worry too much about it. But I do worry somewhat,
and I do things like read every commit message for the development
version and I sometimes hold off on
updating my version if a particular significant change has recently
landed.
But, well, sometimes things go wrong in a development version. As
covered in Rob Norris's An (almost) catastrophic OpenZFS bug and
the humans that made it (and Rust is here too)
(via), there
was a recently discovered bug in the development version of OpenZFS
that could or would have corrupted RAIDZ vdevs. When I saw the
fix commit
go by in the development version, I felt extremely lucky that I use
mirror vdevs, not raidz, and so avoided being affected by this.
(While I might have detected this at the first scrub after some
data was corrupted, the data would have been gone and at a minimum
I'd have had to restore it from backups. Which I don't currently
have on my home desktop.)
In general this is a pointed reminder that the development version
of OpenZFS isn't perfect, no matter how long I and other people
have been lucky with it. You might want to think twice before
running the development version in order to, for example, get support
for the very latest kernels that are used by distributions like
Fedora. Perhaps you're better off delaying your kernel upgrades a
bit longer and sticking to released branches.
I don't know if this is going to change my practices around running
the development version of OpenZFS on my desktops. It may make me
more reluctant to update to the very latest version on my home
desktop; it would be straightforward to have that run only time-delayed
versions of what I've already run through at least one scrub cycle
on my office desktop (where I have backups). And I probably won't
switch to the next release version when it comes out, partly because
of kernel support issues.
I recently read systemd has been a complete, utter, unmitigated
success
(via
among other places), where I found a mention of an interesting
systemd piece that I'd previously been unaware of, systemd-socket-proxyd.
As covered in the article, the major purpose of systemd-socket-proxyd
is the bridge between systemd dynamic socket activation and a
conventional programs that listens on some socket, so that you can
dynamically activate the program when a connection comes in.
Unfortunately the systemd-socket-proxyd manual page is a little bit
opaque about how it works for this purpose (and what the limitations
are). Even though I'm familiar with systemd stuff, I had to think
about it for a bit before things clicked.
A systemd socket unit
activates the corresponding service unit when a connection comes
in on the socket. For simple services that are activated separately
for each connection (with 'Accept=yes'),
this is actually a templated unit,
but if you're using it to activate a regular daemon like sshd (with 'Accept=no') it will be
a single .service unit. When systemd activates this unit, it will
pass the socket to it either through systemd's native mechanism
or an inetd-compatible mechanism using standard input.
If your listening program supports either mechanism, you don't need
systemd-socket-proxyd and your life is simple. But plenty of
interesting programs don't; they expect to start up and bind to
their listening socket themselves. To work with these programs,
systemd-socket-proxyd accepts a socket (or several) from systemd
and then proxies connections on that socket to the socket your
program is actually listening to (which will not be the official
socket, such as port 80 or 443).
All of this is perfectly fine and straightforward, but the question
is, how do we get our real program to be automatically started when
a connection comes in and triggers systemd's socket activation?
The answer, which isn't explicitly described in the manual page but
which appears in the examples, is that we make the socket's .service
unit (which will run systemd-socket-proxyd) also depend on the
.service unit for our real service with a 'Requires=' and an 'After='.
When a connection comes in on the main socket that systemd is doing
socket activation for, call it 'fred.socket', systemd will try to
activate the corresponding .service unit, 'fred.service'. As it does
this, it sees that fred.service depends on 'realthing.service' and
must be started after it, so it will start 'realthing.service' first.
Your real program will then start, bind to its local socket, and then
have systemd-socket-proxyd proxy the first connection to it.
To automatically stop everything when things are idle, you set
systemd-socket-proxyd's --exit-idle-time
option and also set StopWhenUnneeded=true
on your program's real service unit ('realthing.service' here).
Then when systemd-socket-proxyd is idle for long enough, it will
exit, systemd will notice that the 'fred.service' unit is no longer
active, see that there's nothing that needs your real service unit
any more, and shut that unit down too, causing your real program
to exit.
The obvious limitation of using systemd-socket-proxyd is that your
real program no longer knows the actual source of the connection.
If you use systemd-socket-proxyd to relay HTTP connections on port
80 to an nginx instance that's activated on demand (as shown in the
examples in the systemd-socket-proxyd manual page), that nginx
sees and will log all of the connections as local ones. There are
usage patterns where this information will be added by something
else (for example, a frontend server that is a reverse proxy to
a bunch of activated on demand backend servers), but otherwise you're out of
luck as far as I know.
Another potential issue is that systemd's idea of when the .service
unit for your real program has 'started' and thus it can start
running systemd-socket-proxyd may not match when your real program
actually gets around to setting up its socket. I don't know if
systemd-socket-proxyd will wait and try a bit to cope with the
situation where it gets started a bit faster than your real program
can get its socket ready.
(Systemd has ways that your real program can signal readiness, but
if your program can use these ways it may well also support being
passed sockets from systemd as a direct socket activated thing.)
Linux's NFS export handling system has a very convenient option
where you don't have to put all of your exports into one file,
/etc/exports, but can instead write them into a bunch of separate
files in /etc/exports.d. This is very convenient for allowing you
to manage filesystem exports separately from each other and to add,
remove, or modify only a single filesystem's exports. Also, one of
the things that exportfs(8) can do
is 'reexport' all current exports, synchronizing the system state
to what is in /etc/exports and /etc/exports.d; this is 'exportfs
-r', and is a handy thing to do after you've done various manipulations
of files in /etc/exports.d.
Although it's not documented and not explicit in 'exportfs -v -r'
(which will claim to be 'exporting ...' for various things), I have
an important safety tip which I discovered today: exportfs does
nothing on a re-export if you have any problems in your exports.
In particular, if any single file in /etc/exports.d has a problem,
no files from /etc/exports.d get processed and no exports are
updated.
One potential problem with such files is syntax errors, which is
fair enough as a 'problem'. But another problem is that they refer
to directories that don't exist, for example because you have
lingering exports for a ZFS pool that you've temporarily exported
(which deletes the directories that the pool's filesystems may have
previously been mounted on). A missing directory is an error even
if the exportfs options include 'mountpoint', which only does the
export if the directory is a mount point.
When I stubbed my toe on this I was surprised. What I'd vaguely
expected was that the error would cause only the particular file
in /etc/exports.d to not be processed, and that it wouldn't be a
fatal error for the entire process. Exportfs itself prints no notices
about this being a fatal problem, and it will happily continue to
process other files in /etc/exports.d (as you can see with 'exportfs
-v -r' with the right ordering of where the problem file is) and
claim to be exporting them.
A variety of things in typical graphical desktop sessions communicate
through the use of environment variables; for example, X's $DISPLAY
environment variable. Somewhat famously, modern desktops run a
lot of things as systemd user units, and
it might be nice to do that yourself (cf). When you put these two
facts together, you wind up with a question, namely how the environment
works in systemd user units and what problems you're going to run
into.
The simplest case is using systemd-run to
run a user scope unit ('systemd-run --user --scope --'), for example
to run a CPU heavy thing with low priority. In this situation, the new scope
will inherit your entire current environment and nothing else. As
far as I know, there's no way to do this with other sorts of things
that systemd-run will start.
Non-scope user units by default inherit their environment from your
user "systemd manager". I believe that there is always only a single
user manager for all sessions of a particular user, regardless of
how you've logged in. When starting things via 'systemd-run', you
can selectively pass environment variables from your current
environment with 'systemd-run --user -E <var> -E <var> -E ...'. If
the variable is unset in your environment but set in the user systemd
manager, this will unset it for the new systemd-run started unit.
As you can tell, this will get very tedious if you want to pass a
lot of variables from your current environment into the new unit.
You can manipulate your user "systemd manager environment block",
as systemctl
describes it in Environment Commands.
In particular, you can export current environment settings to it
with 'systemctl --user import-environment VAR VAR2 ...'. If you
look at this with 'systemctl --user show-environment', you'll see
that your desktop environment has pushed a lot of environment
variables into the systemd manager environment block, including
things like $DISPLAY (if you're on X). All of these environment
variables for X, Wayland, DBus, and so on are probably part of how
the assorted user units that are part of your desktop session talk
to the display and so on.
You may now see a little problem. What happens if you're logged in
with a desktop X session, and then you go elsewhere and SSH in to
your machine (maybe with X forwarding) and try to start a graphical
program as a systemd user unit? Since you only have a single systemd
manager regardless of how many sessions you have, the systemd user
unit you started from your SSH session will inherit all of the
environment variables that your desktop session set and it will
think it has graphics and open up a window on your desktop (which
is hopefully locked, and in any case it's not useful to you over
SSH). If you import the SSH session's $DISPLAY (or whatever) into
the systemd manager's environment, you'll damage your desktop
session.
For specific environment variables, you can override or remove them
with 'systemd-run --user -E ...' (for example, to override or remove
$DISPLAY). But hunting down all of the session environment variables
that may trigger undesired effects is up to you, making systemd-run's
user scope units by far the easiest way to deal with this.
(I don't know if there's something extra-special about scope units
that enables them and only them to be passed your entire environment,
or of this is simply a limitation in systemd-run that it doesn't
try to implement this for anything else.)
The reason I find all of this regrettable is that it makes putting
applications and other session processes into their own units much
harder than it should be. Systemd-run's scope units inherit your
session environment but can't be detached, so at a minimum you
have extra systemd-run processes sticking around (and putting
everything into scopes when some of them might be services is
unaesthetic). Other units can be detached but don't inherit your
environment, requiring assorted contortions to make things work.
PS: Possibly I'm missing something obvious about how to do this
correctly, or perhaps there's an existing helper that can be used
generically for this purpose.
If you read manual pages, such as Linux's errno(3), you'll
soon discover an important and peculiar seeming limitation of
looking at errno. To quote the Linux version:
The value in errno is significant only when the return value of the
call indicated an error (i.e., -1 from most system calls; -1 or NULL
from most library functions); a function that succeeds is allowed to
change errno. The value of errno is never set to zero by any system
call or library function.
This is also more or less what POSIX says in errno,
although in standards language that's less clear. All of this is a sign
of what has traditionally been going on behind the scenes in Unix.
The classical Unix approach to kernel system calls doesn't return
multiple values, for example the regular return value and errno.
Instead, Unix kernels have traditionally returned either a success
value or the errno value along with an indication of failure, telling
them apart in various ways (such as the PDP-11 return method). At the C library level, the simple approach taken
in early Unix was that system call wrappers only bothered to set
the C level errno if the kernel signaled an error. See, for
example, the V7 libc/crt/cerror.s
combined with libc/sys/dup.s,
where the dup() wrapper only jumps to cerror and sets errno if
the kernel signals an error. The system call wrappers could all
have explicitly set errno to 0 on success, but they didn't.
The next issue is that various C library calls may make a number
of system calls themselves, some of which may fail without the
library call itself failing. The classical case is stdio checking
to see whether stdout is connected to a terminal and so should be
line buffered, which was traditionally implemented by trying to do
a terminal-only ioctl() to the file descriptors, which would fail
with ENOTTY on non-terminal file descriptors. Even if stdio did a
successful write() rather than only buffering your output, the
write() system call wrapper wouldn't change the existing ENOTTY
errno value from the failed ioctl(). So you can have a fwrite()
(or printf() or puts() or other stdio call) that succeeds while
'setting' errno to some value such as ENOTTY.
When ANSI C and POSIX came along, they inherited this existing
situation and there wasn't much they could do about it (POSIX
was mostly documenting existing practice). I believe
that they also wanted to allow a situation where POSIX functions
were implemented on top of whatever oddball system calls you wanted
to have your library code do, even if they set errno. So the only
thing POSIX could really require was the traditional Unix behavior
that if something failed and it was documented to set errno on
failure, you could then look at errno and have it be meaningful.
(This was what existing Unixes were already mostly doing and
specifying it put minimal constraints on any new POSIX environments,
including POSIX environments on top of other operating systems.)
Broadly, there have been three approaches to command history in
Unix shells. In the beginning there was none, which was certainly
simple but which led people to be unhappy. Then csh gave us in-memory
command history, which could be recalled and edited with shell
builtins like '!!' but which lasted only as long as that shell
process did. Finally, people started putting 'readline style'
interactive command editing into shells, which included some history
of past commands that you could get back with cursor-up, and picked
up the GNU Readline feature of a $HISTORY file. Broadly speaking,
the shell would save the in-memory (readline) history to $HISTORY
when it exited and load the in-memory (readline) history from
$HISTORY when it started.
I use a reimplementation of rc, the shell created by Tom Duff, and my version of the shell started
out with a rather different and more minimal mechanism for history.
In the initial release of this rc, all the shell itself did was
write every command executed to $history (if that variable was
set). Inspecting and reusing commands from a $history file was
left up to you, although rc provided a helper program that could
be used in a variety of ways. For
example, in a terminal window I commonly used '-p' to print the
last command and then either copied and pasted it with the mouse
or used an rc function I wrote to repeat it directly.
(You didn't have to set $history to the same file in every
instance of rc. I arranged to have a per-shell history file that
was removed when the shell exited, because I was only interested
in short term 'repeat a previous command' usage of history.)
Later, the version of rc that I use got support for GNU Readline
and other line editing environments (and I started using it). GNU Readline maintains its own in-memory command
history, which is used for things like cursor-up to the previous
line. In rc, this in-memory command history is distinct from the
$history file history, and things can get confusing if you mix
the two (for example, cursor-up to an invocation of your 'repeat
the last command' function won't necessarily repeat the command you
expect).
It turns out that at least for GNU Readline, the current implementation
in rc does the obvious thing; if $history is set when rc
starts, the commands from it are read into GNU Readline's in-memory
history.
This is one half of the traditional $HISTORY behavior. Rc's current
GNU Readline code doesn't attempt to save its in-memory history
back to $history on exit, because if $history is set the regular
rc code has already been recording all of your commands there. Rc
otherwise has no shell builtins to manipulate GNU Readline's command
history, because GNU Readline and other line editing alternatives are
just optional extra features that have relatively minimal hooks into
the core of rc.
(In theory this allows thenshell to
inject a synthetic command history into rc on startup, but it
requires thenshell to know exactly how I handle my per-shell
history file.)
Sidebar: How I create per-shell history in this version of rc
The version of rc that I use
doesn't have an 'initialization' shell function that runs when the
shell is started, but it does support a 'prompt' function that's run
just before the prompt is printed. So my prompt function keeps track
of the 'expected shell PID' in a variable and compares it to the
actual PID. If there's a mismatch (including the variable being
unset), the prompt function goes through a per-shell initialization,
including setting up my per-shell $history value.
Suppose, not hypothetically, that you have
a central CUPS print server, and that people also have Linux desktops
or laptops that they point at your print server to print to your
printers. As of at least Ubunut 24.04, if you're doing this you
probably want to get people to turn off and disable cups-browsed on their machines.
If you don't, your central print server may see a constant flood
of connections from client machines running cups-browsed. You're
probably running it, as I believe that cups-browsed is installed
and activated by default these days in most desktop Linux environments.
(We didn't really notice this in prior Ubuntu versions, although
it's possible cups-browsed was always doing something like this
and what's changed in the Ubuntu 24.04 version is that it's doing
it more and faster.)
I'm not entirely sure why this happens, and I'm also not sure what
the CUPS requests typically involve, but one pattern that we see
is that such clients will make a lot of requests to the CUPS server's
/admin/ URL. I'm not sure what's in these requests, because CUPS
immediately rejects them as unauthenticated. Another thing we've
seen is frequent attempts to get printer attributes for printers
that don't exist and that have name patterns that look like local
printers. One of the reason that the clients are hitting the /admin/
endpoint may be to somehow add these printers to our CUPS server,
which is definitely not going to work.
(We've also seen signs that some Ubuntu 24.04 applications can
repeatedly spam the CUPS server, probably with status requests for
printers or print jobs. This may be something enabled or encouraged
by cups-browsed.)
My impression is that modern Linux desktop software, things like
cups-browsed included, is not really spending much time thinking
about larger scale, managed Unix environments where there are a
bunch of printers (or at least print queues), the 'print server'
is not on your local machine and not run by you, anything random
you pick up through broadcast on the local network is suspect, and
so on. I broadly sympathize with this, because such environments
are a small minority now, but it would be nice if client side CUPS
software didn't cause problems in them.
(I suspect that cups-browsed and its friends are okay in an environment
where either the 'print server' is local or it's operated by you
and doesn't require authentication, there's only a few printers,
everyone on the local network is friendly and if you see a printer
it's definitely okay to use it, and so on. This describes a lot of
Linux desktop environments, including my home desktop.)
I recently wrote about how the X Window System didn't immediately
have (thin client) X terminals. X terminals
are now a relatively obscure part of history and it may not be
obvious to people today why they were a relatively significant deal
at the time. So today I'm going to add some additional notes about
X terminals in their heyday, from their introduction around 1989
through the mid 1990s.
One of the reactions to my entry that I've seen is to wonder if
there was much point to X terminals, since it seems like they should
be close to much more functional normal computers and all you'd
save is perhaps storage. Practically this wasn't the case in 1989
when they were introduced; NCD's initial models cost substantially
less than, say, a Sparcstation 1 (also introduced in 1989), it
appears less than half the cost of even a diskless Sparcstation 1.
I believe that one reason for this is that memory was comparatively
more expensive in those days and X terminals could get away with
much, much less of it, since they didn't need to run a Unix kernel
and enough of a Unix user space to boot up the X server (and I
believe that some or all of the software was run directly from ROM
instead of being loaded into precious RAM).
(The NCD16
apparently started at 1 MByte of RAM and the NCD19 at 2 MBytes,
for example. You could apparently get a Sparcstation 1 with that
little memory but you probably didn't want to use it for much.)
In one sense, early PCs were competition for X terminals in that
they put computation on people's desks, but in another sense they
weren't, because you couldn't use them as an inexpensive way to get
Unix on people's desks. There eventually was at least one piece of
software for this, DESQview/X, but it appeared
later and you'd have needed to also buy the PC to run it on, as
well as a 'high resolution' black and white display card and monitor.
Of course, eventually the march of PCs made all of that cheap, which was part of the diminishing interest
in X terminals in the later part of the 1990s and onward.
(I suspect that one reason that X terminals had lower hardware costs
was that they probably had what today we would call a 'unified
memory system', where the framebuffer's RAM was regular RAM instead
of having to be separate because it came on a separate physical
card.)
You might wonder how well X terminals worked over the 10 MBit
Ethernet that was all you had at the time. With the right programs
it could work pretty well, because the original approach of X was
that you sent drawing commands to the X server, not rendered
bitmaps. If you were using things
that could send simple, compact rendering commands to your X terminal,
such as xterm, 10M Ethernet could be perfectly okay. Anything that
required shipping bitmapped graphics could be not as impressive,
or even not something you'd want to touch, but for what you typically
used monochrome X for between 1989 and 1995 or so, this was generally
okay.
(Today many things on X want to ship bitmaps around, even for things
like displaying text. But back in the day text was shipped as, well,
text, and it was the X server that rendered the fonts.)
When looking at the servers you'd need for a given number of diskless
Unix workstations or X terminals, the X terminals required less
server side disk space but potentially more server side memory and
CPU capacity, and were easier to administer. As noted by some
commentators here,
you might also save on commercial software licensing costs if you
could license it only for your few servers instead of your lots of
Unix workstations. I don't know how the system administration load
actually compared to a similar number of PCs or Macs, but in my
Unix circles we thought we scaled much better and could much more
easily support many seats (and many potential users if you had, for
example, many more students than lab desktops).
My perception is that what killed off X terminals as particularly
attractive, even for Unix places, was that on the one hand the extra
hardware capabilities PCs needed over X terminals kept getting
cheaper and cheaper and on the other hand people started demanding
more features and performance, like decent colour displays. That
brought the X terminal 'advantage' more or less down to easier
administration, and in the end that wasn't enough (although some X
terminals and X 'thin client' setups clung on quite late, eg the
SunRay, which we had some of in the 2000s).
(We ran what were effectively
X terminals quite late, but the last few generations were basic PCs
running LTSP not dedicated hardware. All our
Sun Rays got retired well before the LTSP machines.)
(I think that the 'personal computer' model has or at least had
some significant pragmatic advantages over the 'terminal' model,
but that's something for another entry.)
Back in the early days of GPU computation, the hardware, drivers,
and software were so relatively untrustworthy that our early GPU
machines had to be specifically reserved by people and that reservation
gave them the ability to remotely power cycle the machine to recover
it (this was in the days before our SLURM cluster). Things have gotten much better since
then, with things like hardware and driver changes so that programs
with bugs couldn't hard-lock the GPU hardware. But every so often
we run into odd failures where something funny is going on that we
don't understand.
We have one particular SLURM GPU node that has been flaky for a
while, with the specific issue being that every so often the NVIDIA
GPU would throw up its hands and drop off the PCIe bus until we
rebooted the system. This didn't happen every time it was used, or
with any consistent pattern, although some people's jobs seemed to
regularly trigger this behavior. Recently I dug up a simple to use
GPU stress test program, and
when this machine's GPU did its disappearing act this Saturday, I
grabbed the machine, rebooted it, ran the stress test program, and
promptly had the GPU disappear again. Success, I thought, and since
it was Saturday, I stopped there, planning to repeat this process
today (Monday) at work, while doing various monitoring things.
Since I'm writing a Wandering Thoughts entry about it,
you can probably guess the punchline. Nothing has changed on this
machine since Saturday, but all today the GPU stress test program
could not make the GPU disappear. Not with the same basic usage I'd
used Saturday, and not with a different usage that took the GPU to
full power draw and a reported temperature of 80C (which was a
higher temperature and power draw than the GPU had been at when it
disappeared, based on our Prometheus metrics). If I'd been unable
to reproduce the failure at all with the GPU stress program, that
would have been one thing, but reproducing it once and then not
again is just irritating.
(The machine is an assembled from parts one, with an RTX 4090 and
a Ryzen Threadripper 1950X in an X399 Taichi motherboard that is
probably not even vaguely running the latest BIOS, seeing as the
base hardware was built many years ago, although the GPU has been
swapped around since then. Everything is in a pretty roomy 4U case,
but if the failure was consistent we'd have assumed cooling issues.)
I don't really have any theories for what could be going on, but I
suppose I should try to find a GPU stress test program that exercises
every last corner of the GPU's capabilities at full power rather
than using only one or two parts at a time. On CPUs, different
loads light up different functional units, and
I assume the same is true on GPUs, so perhaps the problem is in one
specific functional unit or a combination of them.
(Although this doesn't explain why the GPU stress test program was
able to cause the problem on Saturday but not today, unless a full
reboot didn't completely clear out the GPU's state. Possibly we
should physically power this machine off entirely for long enough
to dissipate any lingering things.)
For a while, X terminals
were a reasonably popular way to give people comparatively inexpensive
X desktops. These X terminals relied on X's network transparency
so that only the X server had to run on the X terminal itself, with
all of your terminal windows and other programs running on a server
somewhere and just displaying on the X terminal. For a long time,
using a big server and a lab full of X terminals was significantly
cheaper than setting up a lab full of actual workstations (until
inexpensive and capable PCs showed up).
Given that X started with network transparency and X terminals are
so obvious, you might be surprised to find out that X didn't start
with them.
In the early days, X ran on workstations. Some of them were diskless
workstations, and on some of them (especially the diskless ones),
you would log in to a server somewhere to do a lot of your more
heavy duty work. But they were full workstations, with a full local
Unix environment and you expected to run your window manager and
other programs locally even if you did your real work on servers.
Although probably some people who had underpowered workstations
sitting around experimented with only running the X server locally,
with everything else done remotely (except perhaps the window
manager).
The first X terminals arrived only once X was reasonably well
established as the successful cross-vendor Unix windowing system.
NCD,
who I suspect were among the first people to make an X terminal,
was founded only in 1987 and of course didn't immediately ship a
product (it may have shipped its first product in 1989). One
indication of the delay in X terminals is that XDM was only
released with X11R3, in October of 1988. You technically didn't
need XDM to have an X terminal, but it made life much easier, so
its late arrival is a sign that X terminals didn't arrive much
before then.
(It's quite possible that the possibility for an 'X terminal' was
on people's minds even in the early days of X. The Bell Labs Blit was a
'graphical terminal' that had papers written and published about
it sometime in 1983 or 1984, and the Blit was definitely known in
various universities and so on. Bell Labs even gave people a few
of them, which is part of how I wound up using one for a while.
Sadly I'm not sure what happened to it in the end, although by now
it would probably be a historical artifact.)
If you didn't have XDM available or didn't want to have to rely on
it, you could give your X terminal the ability to open up a local
terminal window that ran a telnet client. To start up an X environment,
people would telnet into their local server, set $DISPLAY (or have
it automatically set by the site's login scripts), and start at
least their window manager by hand. This required your X terminal
to not use any access control (at least when you were doing the
telnet thing), but strong access control wasn't exactly an X terminal
feature in the first place.
I wrote about a performance mystery with WireGuard on 10G Ethernet, and since then I've done additional
measurements with results that both give some clarity and leave me
scratching my head a bit more. So here is what I know about the
general performance characteristics of Linux kernel WireGuard on a
mixture of Ubuntu 22.04 and 24.04 servers with stock settings, and
using TCP streams inside the WireGuard tunnels (because the high
bandwidth thing we care about runs over
TCP).
CPU performance is important even when WireGuard isn't saturating
the CPU.
CPU performance seems to be more important on the receiving side
than on the sending side. If you have two machines, one faster
than the other, you get more bandwidth sending a TCP stream from
the slower machine to the faster one. I don't know if this is an
artifact of the Linux kernel implementation or if the WireGuard
protocol requires the receiver to do more work than the sender.
There seems to be a single-peer bandwidth limit (related to CPU
speeds). You can increase the total WireGuard bandwidth of a given
server by talking to more than one peer.
When talking to a single peer, there's both a unidirectional
bandwidth limit and a bidirectional bandwidth limit. If you
send and receive to a single peer at once, you don't get the
sum of the unidirectional send and unidirectional receive; you
get less.
There's probably also a total WireGuard bandwidth that, in our
environment, falls short of 10G bandwidth (ie, a server talking
WireGuard to multiple peers can't saturate its 10G connection,
although maybe it could if I had enough peers in my test setup).
The best performance between a pair of WireGuard peers I've gotten
is from two servers with Xeon E-2226G CPUs; these can push their
10G Ethernet to about 850 MBytes/sec of WireGuard bandwidth in one
direction and about 630 MBytes/sec in each direction if they're
both sending and receiving. These servers (and other servers with
slower CPUs) can basically saturate their 10G-T network links with
plain (non-WireGuard) TCP.
If I was to build a high performance 'WireGuard gateway' today, I'd
build it with a fast CPU and dual 10G networks, with WireGuard
traffic coming in (and going out) one 10G interface and the resulting
gatewayed traffic using the other. WireGuard on fast CPUs can run
fast enough that a single 10G interface could limit total bandwidth
under the right (or wrong) circumstances; segmenting WireGuard and
clear traffic onto different interfaces avoids that.
(A WireGuard gateway that only served clients at 1G or less would
likely be perfectly fine with a single 10G interface and reasonably
fast CPUs. But I'd want to test how many 1G clients it took to reach
the total WireGuard bandwidth limit on a 10G WireGuard server before
I was completely confident about that.)
As a followup on discovering that WireGuard can saturate a 1G
Ethernet (on Linux), I set up WireGuard on
some slower servers here that have 10G networking. This isn't an
ideal test but it's more representative of what we would see with
our actual fileservers, since I used spare
fileserver hardware. What I got out of it was a performance and CPU
usage mystery.
What I expected to see was that WireGuard performance would top out
at some level above 1G as the slower CPUs on both the sending and
the receiving host ran into their limits, and I definitely wouldn't
see them drive the network as fast as they could without WireGuard.
What I actually saw was that WireGuard did hit a speed limit but
the CPU usage didn't seem to saturate, either for kernel WireGuard
processing or for the iperf3 process. These machines can manage to
come relatively close to 10G bandwidth with bare TCP, while with
WireGuard they were running around 400 MBytes/sec of on the wire
bandwidth (which translates to somewhat less inside the WireGuard
connection, due to overheads).
One possible explanation for this is increased packet handling
latency, where the introduction of WireGuard adds delays that keep
things from running at full speed. Another possible explanation is
that I'm running into CPU limits that aren't obvious from simple
tools like top and htop. One interesting thing is that if I do a
test in both directions at once (either an iperf3 bidirectional
test or two iperf3 sessions, one in each direction), the bandwidth
in each direction is slightly over half the unidirectional bandwidth
(while a bidirectional test without WireGuard runs at full speed
in both directions at once). This certainly makes it look like
there's a total WireGuard bandwidth limit in these servers somewhere;
unidirectional traffic gets basically all of it, while bidirectional
traffic splits it fairly between each direction.
I looked at 'perf top' on the receiving 10G machine and kernel spin
lock stuff seems to come in surprisingly high. I tried having a 1G
test machine also send WireGuard traffic to the receiving 10G test
machine at the same time and the incoming bandwidth does go up by
about 100 Mbytes/sec, so perhaps on these servers I'm running into
a single-peer bandwidth limitation. I can probably arrange to test
this tomorrow.
(I can't usefully try both of my 1G WireGuard test machines at once
because they're both connected to the same 1G switch, with a 1G
uplink into our 10G switch fabric.)
PS: The two 10G servers are running Ubuntu 24.04 and Ubuntu 22.04
respectively with standard kernels; the faster server with more
CPUs was the 'receiving' server here, and is running 24.04. The two
1G test servers are running Ubuntu 24.04.
I'm used to thinking of encryption as a slow thing that can't deliver
anywhere near to network saturation, even on basic gigabit Ethernet
connections. This is broadly the experience we see with our current VPN servers,
which struggle to turn in more than relatively anemic bandwidth
with OpenVPN and L2TP, and so for a long time I assumed it would
also be our experience with WireGuard
if we tried to put anything serious behind it. I'd seen the 2023
Tailscale blog post about this
but discounted it as something we were unlikely to see; as their
kernel throughput on powerful sounding AWS nodes was anemic by 10G
standards, so I assumed our likely less powerful servers wouldn't
even get 1G rates.
Today, for reasons beyond the scope of this entry, I wound up
wondering how fast we could make WireGuard go. So I grabbed a
couple of spare servers we had with reasonably modern CPUs (by our
limited standards), put our standard Ubuntu 24.04 on them, and took
a quick look to see how fast I could make them go over 1G networking.
To my surprise, the answer is that WireGuard can saturate that 1G
network with no particularly special tuning, and the system CPU
usage is relatively low (4.5% on the client iperf3 side, 8% on the
server iperf3 side; each server has a single Xeon E-2226G). The low
usage suggests that we could push well over 1G of WireGuard bandwidth
through a 10G link, which means that I'm going to set one up for
testing at some point.
While the Xeon E-2226G is not a particularly impressive CPU, it's
better than the CPUs our NFS fileservers
have (the current hardware has Xeon Silver 4410Ys). But I suspect
that we could sustain over 1G of WireGuard bandwidth even on them,
if we wanted to terminate WireGuard on the fileservers instead of
on a 'gateway' machine with a fast CPU (and a 10G link).
More broadly, I probably need to reset my assumptions about the
relative speed of encryption as compared to network speeds. These
days I suspect a lot of encryption methods can saturate a 1G network
link, at least in theory, since I don't think WireGuard is exceptionally
good in this respect (as I understand it, encryption speed wasn't
particularly a design goal; it was designed to be secure first).
Actual implementations may vary for various reasons so perhaps our
VPN servers need some tuneups.
(The actual bandwidth achieved inside WireGuard is less than the
1G data rate because simply being encrypted adds some overhead.
This is also something I'm going to have to remember when doing
future testing; if I want to see how fast WireGuard is driving the
underlying networking, I should look at the underlying networking
data rate, not necessarily WireGuard's rate.)