❌

Reading view

There are new articles available, click to refresh the page.

The broad state of ZFS on Illumos, Linux, and FreeBSD (as I understand it)

By: cks

Once upon a time, Sun developed ZFS and put it in Solaris, which was good for us. Then Sun open-sourced Solaris as 'OpenSolaris', including ZFS, although not under the GPL (a move that made people sad and Scott McNealy is on record as regretting). ZFS development continued in Solaris and thus in OpenSolaris until Oracle bought Sun and soon afterward closed Solaris source again (in 2010); while Oracle continued ZFS development in Oracle Solaris, we can ignore that. OpenSolaris was transmogrified into Illumos, and various Illumos distributions formed, such as OmniOS (which we used for our second generation of ZFS fileservers).

Well before Oracle closed Solaris, separate groups of people ported ZFS into FreeBSD and onto Linux, where the effort was known as "ZFS on Linux". Since the Linux kernel community felt that ZFS's license wasn't compatible with the kernel's license, ZoL was an entirely out of (kernel) tree effort, while FreeBSD was able to accept ZFS into their kernel tree (I believe all the way back in 2008). Both ZFS on Linux and FreeBSD took changes from OpenSolaris into their versions up until Oracle closed Solaris in 2010. After that, open source ZFS development split into three mostly separate strands.

(In theory OpenZFS was created in 2013. In practice I think OpenZFS at the time was not doing much beyond coordination of the three strands.)

Over time, a lot more people wanted to build machines using ZFS on top of FreeBSD or Linux (including us) than wanted to keep using Illumos distributions. Not only was Illumos a different environment, but Illumos and its distributions didn't see the level of developer activity that FreeBSD and Linux did, which resulted in driver support issues and other problems (cf). For ZFS, the consequence of this was that many more improvements to ZFS itself started happening in ZFS on Linux and in FreeBSD (I believe to a lesser extent) than were happening in Illumos or OpenZFS, the nominal upstream. Over time the split of effort between Linux and FreeBSD became an obvious problem and eventually people from both sides got together. This resulted in ZFS on Linux v2.0.0 becoming 'OpenZFS 2.0.0' in 2020 (see also the Wikipedia history) and also becoming portable to FreeBSD, where it became the FreeBSD kernel ZFS implementation in FreeBSD 13.0 (cf).

The current state of OpenZFS is that it's co-developed for both Linux and FreeBSD. The OpenZFS ZFS repository routinely has FreeBSD specific commits, and as far as I know OpenZFS's test suite is routinely run on a variety of FreeBSD machines as well as a variety of Linux ones. I'm not sure how OpenZFS work propagates into FreeBSD itself, but it does (some spelunking of the FreeBSD source repository suggests that there are periodic imports of the latest changes). On Linux, OpenZFS releases and development versions propagate to Linux distributions in various ways (some of them rather baroque), including people simply building their own packages from the OpenZFS repository.

Illumos continues to use and maintain its own version of ZFS, which it considers separate from OpenZFS. There is an incomplete Illumos project discussion on 'consuming' OpenZFS changes (via, also), but my impression is that very few changes move from OpenZFS to Illumos. My further impression is that there is basically no one on the OpenZFS side who is trying to push changes into Illumos; instead, OpenZFS people consider it up to Illumos to pull changes, and Illumos people aren't doing much of that for various reasons. At this point, if there's an attractive ZFS change in OpenZFS, the odds of it appearing in Illumos on a timely basis appear low (to put it one way).

(Some features have made it into Illumos, such as sequential scrubs and resilvers, which landed in issue 10405. This feature originated in what was then ZoL and was ported into Illumos.)

Even if Illumos increases the pace of importing features from OpenZFS, I don't ever expect it to be on the leading edge and I think that's fine. There have definitely been various OpenZFS features that needed some time before they became fully ready for stable production use (even after they appeared in releases). I think there's an ecological niche for a conservative ZFS that only takes solidly stable features, and that fits Illumos's general focus on stability.

PS: I'm out of touch with the Illumos world these days, so I may have mis-characterized the state of affairs there. If so, I welcome corrections and updates in the comments.

ZFS snapshots aren't as immutable as I thought, due to snapshot metadata

By: cks

If you know about ZFS snapshots, you know that one of their famous properties is that they're immutable; once a snapshot is made, its state is frozen. Or so you might casually describe it, but that description is misleading. What is frozen in a ZFS snapshot is the state of the filesystem (or zvol) that it captures, and only that. In particular, the metadata associated with the snapshot can and will change over time.

(When I say it this way it sounds obvious, but for a long time my intuition about how ZFS operated was misled by me thinking that all aspects of a snapshot had to be immutable once made and trying to figure out how ZFS worked around that.)

One visible place where ZFS updates the metadata of a snapshot is to maintain information about how much unique space the snapshot is using. Another is that when a ZFS snapshot is deleted, other ZFS snapshots may require updates to adjust the list of snapshots (every snapshot points to the previous one) and the ZFS deadlist of blocks that are waiting to be freed.

Mechanically, I believe that various things in a dsl_dataset_phys_t are mutable, with the exception of things like the creation time and the creation txg, and also the block pointer, which points to the actual filesystem data of the snapshot. Things like the previous snapshot information have to be mutable (you might delete the previous snapshot), and things like the deadlist and the unique bytes are mutated as part of operations like snapshot deletion. The other things I'm not sure of.

(See also my old entry on a broad overview of how ZFS is structured on disk. A snapshot is a 'DSL dataset' and it points to the object set for that snapshot. The root directory of a filesystem DSL dataset, snapshot or otherwise, is at a fixed number in the object set; it's always object 1. A snapshot freezes the object set as of that point in time.)

PS: Another mutable thing about snapshots is their name, since 'zfs rename' can change that. The manual page even gives an example of using (recursive) snapshot renaming to keep a rolling series of daily snapshots.

How I think OpenZFS's 'written' and 'written@<snap>' dataset properties work

By: cks

Yesterday I wrote some notes about ZFS's 'written' dataset property, where the short summary is that 'written' reports the amount of space written in a snapshot (ie, that wasn't in the previous snapshot), and 'written@<snapshot>' reports the amount of space written since the specified snapshot (up to either another snapshot or the current state of the dataset). In that entry, I left un-researched the question of how ZFS actually gives us those numbers; for example, if there was a mechanism in place similar to the complicated one for 'used' space. I've now looked into this and as far as I can see the answer is that ZFS determines information on the fly.

The guts of the determination are in dsl_dataset_space_written_impl(), which has a big comment that I'm going to quote wholesale:

Return [...] the amount of space referenced by "new" that was not referenced at the time the bookmark corresponds to. "New" may be a snapshot or a head. The bookmark must be before new, [...]

The written space is calculated by considering two components: First, we ignore any freed space, and calculate the written as new's used space minus old's used space. Next, we add in the amount of space that was freed between the two time points, thus reducing new's used space relative to old's. Specifically, this is the space that was born before zbm_creation_txg, and freed before new (ie. on new's deadlist or a previous deadlist).

(A 'bookmark' here is an internal ZFS thing.)

When this talks about 'used' space, this is not the "used" snapshot property; this is the amount of space the snapshot or dataset refers to, including space shared with other snapshots. If I'm understanding the code and the comment right, the reason we add back in freed space is because otherwise you could wind up with a negative number. Suppose you wrote a 2 GB file, made one snapshot, deleted the file, and then made a second snapshot. The difference in space referenced between the two snapshots is slightly less than negative 2 GB, but we can't report that as 'written', so we go through the old stuff that got deleted and add its size back in to make the number positive again.

To determine the amount of space that's been freed between the bookmark and "new", the ZFS code walks backward through all snapshots from "new" to the bookmark, calling another ZFS function to determine how much relevant space got deleted. This uses the ZFS deadlists that ZFS is already keeping track of to know when it can free an object.

This code is used both for 'written@<snap>' and 'written'; the only difference between them is that when you ask for 'written', the ZFS kernel code automatically finds the previous snapshot for you.

Some notes on OpenZFS's 'written' dataset property

By: cks

ZFS snapshots and filesystems have a 'written' property, and a related 'written@snapshot one. These are documented as:

written
The amount of space referenced by this dataset, that was written since the previous snapshot (i.e. that is not referenced by the previous snapshot).

written@snapshot
The amount of referenced space written to this dataset since the specified snapshot. This is the space that is referenced by this dataset but was not referenced by the specified snapshot. [...]

(Apparently I never noticed the 'written' property before recently, despite it being there from very long ago.)

The 'written' property is related to the 'used' property, and it's both more confusing and less confusing as it relates to snapshots. Famously (but not famously enough), for snapshots the used property ('USED' in the output of 'zfs list') only counts space that is exclusive to that snapshot. Space that's only used by snapshots but that is shared by more than one snapshot is in 'usedbysnapshots'.

To understand 'written' better, let's do an experiment: we'll make a snapshot, write a 2 GByte file, make a second snapshot, write another 2 GByte file, make a third snapshot, and then delete the first 2 GB file. Since I've done this, I can tell you the results.

If there are no other snapshots of the filesystem, the first snapshot's 'written' value is the full size of the filesystem at the time it was made, because everything was written before it was made. The second snapshot's 'written' is 2 GBytes, the data file we wrote between the first and the second snapshot. The third snapshot's 'written' is another 2 GB, for the second file we wrote. However, at the end, after we delete one of the data files, the filesystem's 'written' is small (certainly not 2 GB), and so would be the 'written' of a fourth snapshot if we made one.

The reason the filesystem's 'written' is so small is that ZFS is counting concrete on-disk (new) space. Deleting a 2 GB file frees up a bunch of space but it doesn't require writing very much to the filesystem, so the 'written' value is low.

If we look at the 'used' values for all three snapshots, they're all going to be really low. This is because both 2 GByte data files we wrote are shared between the second and the third snapshot. Since they're both in multiple snapshots, they're in 'usedbysnapshots' but not 'used'.

(ZFS has a somewhat complicated mechanism to maintain all of this information.)

There is one interesting 'written' usage that appears to show you deleted space, but it is a bit tricky. The manual page implies that the normal usage of 'written@<snapshot>' is to ask for it for the filesystem itself; however, in experimentation you can ask for it for a snapshot too. So take the three snapshots above, and the filesystem after deleting the first data file. If you ask for 'written@first' for the filesystem, you will get 2 GB, but if you ask for 'written@first' for the third snapshot, you will get 4 GB. What the filesystem appears to be reporting is how much still-live data has been written between the first snapshot and now, which is only 2 GB because we deleted the other 2 GB. Meanwhile, all four GB are still alive in the third snapshot.

My conclusion from looking into this is that I can use 'written' as an indication of how much new data a snapshot has captured, but I can't use it as an indication of how much changed in a snapshot. As I've seen, deleting data is a potentially big change but a small 'written' value. If I'm understanding 'written' correctly, one useful thing about it is that it shows roughly how much data an incremental 'zfs send' of just that snapshot would send. Under some circumstances it will also give you an idea of how much data your backup system may need to back up; however, this works best if people are creating new files (and deleting old ones), instead of updating or appending to existing files (where ZFS only updates some blocks but a backup system probably needs to re-save the whole thing).

Revisiting ZFS's ZIL, separate log devices, and writes

By: cks

Many years ago I wrote a couple of entries about ZFS's ZIL optimizations for writes and then an update for separate log devices. In completely unsurprising news, OpenZFS's behavior has changed since then and gotten simpler. The basic background for this entry is the flow of activity in the ZIL (ZFS Intent Log).

When you write data to a ZFS filesystem, your write will be classified as 'indirect', 'copied', or 'needcopy'. A 'copied' write is immediately put into the in-memory ZIL even before the ZIL is flushed to disk, a 'needcopy' write will be put into the in-memory ZIL if a (filesystem) sync() or fsync() happens and then written to disk as part of the ZIL flush, and an 'indirect' write will always be written to its final place in the filesystem even if the ZIL is flushed to disk, with the ZIL just containing a pointer to the regular location (although at that point the ZIL flush depends on those regular writes). ZFS keeps metrics on how much you have of all of these, and they're potentially relevant in various situations.

As of the current development version of OpenZFS (and I believe for some time in released versions), how writes are classified is like this, in order:

  1. If you have 'logbias=throughput' set or the write is an O_DIRECT write, it is an indirect write.
  2. If you don't have a separate log device and the write is equal to or larger than zfs_immediate_write_sz (32 KBytes by default), it is an indirect write.

  3. If this is a synchronous write, it is a 'copied' write, including if your filesystem has 'sync=always' set.

  4. Otherwise it's a 'needcopy' write.

If your system is doing normal IO (well, normal writes) and you don't have a separate log device, large writes are indirect writes and small writes are 'needcopy' writes. This keeps both of them out of the in-memory ZIL. However, on our systems I see a certain volume of 'copied' writes, suggesting that some programs or ZFS operations force synchronous writes. This seems to be especially common on our ZFS based NFS fileservers, but it happens to some degree even on the ZFS fileserver that mostly does local IO.

The corollary to this is that if you do have a separate log device and you don't do O_DIRECT writes (and don't set logbias=throughput), all of your writes will go to your log device during ZIL flushes, because they'll fall through the first two cases and into case three or four. If you have a sufficiently high write volume combined with ZIL flushes, this may increase the size of a separate log device that you want and also make you want one that has a high write bandwidth (and can commit things to durable storage rapidly).

(We don't use any separate log devices for various reasons and I don't have well informed views of when you should use them and what sort of device you should use.)

Once upon a time (when I wrote my old entry), there was a zil_slog_limit tunable that pushed some writes back to being indirect writes even if you had a separate log device, under somewhat complex circumstances. That was apparently removed in 2017 and was partly not working even before then (also).

We've chosen to 'modernize' all of our ZFS filesystems

By: cks

We are almost all of the way to the end of a multi-month process of upgrading our ZFS fileservers from Ubuntu 22.04 to 24.04 by also moving to more recent hardware. This involved migrating all of our pools and filesystems, involving terabytes of data. Our traditional way of doing this sort of migration (which we used, for example, when going from our OmniOS fileservers to our Linux fileservers was the good old reliable 'zfs send | zfs receive' approach of sending snapshots over. This sort of migration is fast, reliable, and straightforward. However, it has one drawback, which is that it preserves all of the old filesystem's history, including things like the possibility of panics and possibly other things.

We've been running ZFS for long enough that we had some ZFS filesystems that were still at ZFS filesystem version 4. In late 2023, we upgraded them all to ZFS filesystem version 5, and after that we got some infrequent kernel panics. We could never reproduce the kernel panics and they were very infrequent, but 'infrequent' is not the same as 'never' (the previous state of affairs), and it seemed likely that they were in some way related to upgrading our filesystem versions, which in turn was related to us having some number of very old filesystems. So in this migration, we deliberately decided to 'migrate' filesystems the hard way. Which is to say, rather than migrating the filesystems, we migrated the data with user level tools, moving it into pools and filesystems that were created from scratch on our new Ubuntu 24.04 fileservers (which led us to discover that default property values sometimes change in ways that we care about).

(The filesystems reused the same names as their old versions, because that keeps things easier for our people and for us.)

It's possible that this user level rewriting of all data has wound up laying things out in a better way (although all of this is on SSDs), and it's certainly insured that everything has modern metadata associated with it and so on. The 'fragmentation' value of the new pools on the new fileservers is certainly rather lower than the value for most old pools, although what that means is a bit complicated.

There's a bit of me that misses the deep history of our old filesystems, some of which dated back to our first generation Solaris ZFS fileservers. However, on the whole I'm happy that we're now using filesystems that don't have ancient historical relics and peculiarities that may not be well supported by OpenZFS's code any more (and which were only likely to get less tested and more obscure over time).

(Our pools were all (re)created from scratch as part of our migration from OmniOS to Linux, and anyway would have been remade from scratch again in this migration even if we moved the filesystems with 'zfs send'.)

ZFS's delayed compression of written data (when compression is enabled)

By: cks

In a comment on my entry about how Unix files have at least two sizes, Leah Neukirchen said that 'ZFS compresses asynchronously' and noted that this could cause the reported block size of a just-written file to change over time. This way of describing ZFS's behavior made me twitch and it took me a bit of thinking to realize why. What ZFS does is delayed compression (which is asynchronous with your user level write() calls), but not true 'asynchronous compression' that happens later at an unpredictable time.

Like basically all filesystems, ZFS doesn't immediately start writing data to disk when you do a write() system call. Instead it buffers this data in memory for a while and only writes it later. As part of this, ZFS doesn't immediately decide where on disk the data will be written (this is often called 'delayed allocation' and is common in many filesystems) and otherwise prepare it to be written out. As part of this delayed allocation and preparation, ZFS doesn't immediately compress your written data, and as a result ZFS doesn't know how many disk blocks your data will take up. Instead your data is only compressed and has disk blocks allocated for it as part of ZFS's pipeline of actually performing IO, when the data is flushed to disk, and only then is its physical block size known.

However, once written to disk, the data's compression or lack of it is never changed (nor is anything else about it; ZFS never modifies data once it's written). For example, data isn't initially written in uncompressed form and then asynchronously compressed later. Nor is there anything that goes around asynchronously compressing or decompressing data if you turn on or off compression on a ZFS filesystem (or change the compression algorithm). This periodically irks people who wish they could turn compression on on an existing filesystem, or change the compression algorithm, and have this take effect 'in place' to shrink the amount of space the filesystem is using.

Delaying compressing data until you're writing it out is a sensible decision for a variety of reasons. One of them is that ZFS compresses your data in potentially large chunks, and you may not write() all of that chunk at once. If you wrote half a chunk now and then half a chunk later before it got flushed to disk, it would be a waste of effort to compress your half a chunk now and then throw the away the work when you compressed the whole chunk.

(I also suspect that it was simpler to add compression to ZFS as part of its IO pipeline than to do it separately. ZFS already had a multi-stage IO pipeline, so adding compression and decompression as another step was probably relatively straightforward.)

How ZFS knows and tracks the space usage of datasets

By: cks

Anyone who's ever had to spend much time with 'zfs list -t all -o space' knows the basics of ZFS space usage accounting, with space used by the datasets, data unique to a particular snapshot (the 'USED' value for a snapshot), data used by snapshots in total, and so on. But today I discovered that I didn't really know how it all worked under the hood, so I went digging in the source code. The answer is that ZFS tracks all of these types of space usage directly as numbers, and updates them as blocks are logically freed.

(Although all of these are accessed from user space as ZFS properties, they're not conventional dataset properties; instead, ZFS materializes the property version any time you ask, from fields in its internal data structures. Some of these fields are different and accessed differently for snapshots and regular datasets, for example what 'zfs list' presents as 'USED'.)

All changes to a ZFS dataset happen in a ZFS transaction (group), which are assigned ever increasing numbers, the 'transaction group number(s)' (txg). This includes allocating blocks, which remember their 'birth txg', and making snapshots, which carry the txg they were made in and necessarily don't contain any blocks that were born after that txg. When ZFS wants to free a block in the live filesystem (either because you deleted the object or because you're writing new data and ZFS is doing its copy on write thing), it looks at the block's birth txg and the txg of the most recent snapshot; if the block is old enough that it has to be in that snapshot, then the block is not actually freed and the space for the block is transferred from 'USED' (by the filesystem) to 'USEDSNAP' (used only in snapshots). ZFS will then further check the block's txg against the txgs of snapshots to see if the block is unique to a particular snapshot, in which case its space will be added to that snapshot's 'USED'.

ZFS goes through a similar process when you delete a snapshot. As it runs around trying to free up the snapshot's space, it may discover that a block it's trying to free is now used only by one other snapshot, based on the relevant txgs. If so, the block's space is added to that snapshot's 'USED'. If the block is freed entirely, ZFS will decrease the 'USEDSNAP' number for the entire dataset. If the block is still used by several snapshots, no usage numbers need to be adjusted.

(Determining if a block is unique in the previous snapshot is fairly easy, since you can look at the birth txgs of the two previous snapshots. Determining if a block is now unique in the next snapshot (or for that matter is still in use in the dataset) is more complex and I don't understand the code involved; presumably it involves somehow looking at what blocks were freed and when. Interested parties can look into the OpenZFS code themselves, where there are some surprises.)

PS: One consequence of this is that there's no way after the fact to find out when space shifted from being used by the filesystem to used by snapshots (for example, when something large gets deleted in the filesystem and is now present only in snapshots). All you can do is capture the various numbers over time and then look at your historical data to see when they changed. The removal of snapshots is captured by ZFS pool history, but as far as I know this doesn't capture how the deletion affected the various space usage numbers.

Using a small ZFS recordsize doesn't save you space (well, almost never)

By: cks

ZFS filesystems have a famously confusing 'recordsize' property, which in the past I've summarized as the maximum logical block size of a filesystem object. Sometimes I've seen people suggest that if you want to save disk space, you should reduce your 'recordsize' from the default 128 KBytes. This is almost invariably wrong; in fact, setting a low 'recordsize' is more likely to cost you space.

How a low recordsize costs you space is straightforward. In ZFS, every logical block requires its own DVA to point to it and contain its checksum. The more logical blocks you have, the more DVAs you require and the more space they take up. As you decrease the 'recordsize' of a filesystem, files (well, filesystem objects in general) that are larger than your recordsize will use more and more logical blocks for their data and have more and more DVAs, taking up more and more space.

In addition, ZFS compression operates on logical blocks and must save at least one disk block's worth of space to be considered worthwhile. If you have compression turned on (and if you care about space usage, you should), the closer your 'recordsize' gets to the vdev's disk block size, the harder it is for compression to save space. The limit case is when you make 'recordsize' be the same size as the disk block size, at which point ZFS compression can't do anything.

(This is the 'physical disk block size', or more exactly the vdev's 'ashift', which these days should basically always be 4 KBytes or greater, not the disk's 'logical block size', which is usually still 512 bytes.)

The one case where a large recordsize can theoretically cost you disk space is if you have large files that are mostly holes and you don't have any sort of compression turned on (which these days means specifically turning it off). If you have a (Unix) file that has 1 KByte of data every 128 KBytes and is otherwise not written to, without compression and with the default 128 KByte 'recordsize', you'll get a bunch of 128 KByte blocks that have 1 KByte of actual data and 127 KBytes of zeroes. If you reduced your "recordsize', you would still waste some space but more of it would be actual holes, with no space allocated. However, even the most minimal compression (a setting of 'compression=zle') will entirely eliminate this waste.

(The classical case of reducing 'recordsize' is helping databases out. More generally, you reduce 'recordsize' when you're rewriting data in place in small sizes (such as 4 KBytes or 16 KBytes) or appending data to a file in small sizes, because ZFS can only read and write entire logical blocks.)

PS: If you need a small 'recordsize' for performance, you shouldn't worry about the extra space usage, partly because you should also have a reasonable amount of free disk space to improve the performance of ZFS's space allocation.

ZFS properties sometimes change their default values over time

By: cks

For an assortment of reasons, we don't want ZFS to do compression on most of the filesystems on our fileservers. Some of these reasons are practical technical ones and some of them have to do with our particular local non-technical ('political') decisions around disk space allocation. Traditionally we've done this by the simple mechanism of not specifically enabling compression, because the default was off. Recently I discovered, more or less by coincidence, that OpenZFS had changed the default for ZFS compression from off to on between the version in Ubuntu 22.04 ('v2.1.5' plus Ubuntu changes) and the version in Ubuntu 24.04 ('v2.2.2' plus Ubuntu changes).

(This change was made in early March of 2022 and first appeared in v2.2.0. The change itself is discussed in pull request #13078.)

Another property that changed its default value in OpenZFS v2.2.0 is 'relatime'. This was apparently a change to match general Linux behavior, based on pull request #13614. Since we already specifically turn atime off, we might want to also disable relatime now that it defaults to on, or perhaps it won't have too much of an impact (and in general, atime and relatime may not work over NFS anyway).

These aren't big changes (and they're perfectly sensible ones), but to me they point what should really have already been obvious, which is that OpenZFS can change the default values of properties over time. When you move to the new version of ZFS, you'll probably inherit these new default values, unless you're explicitly setting the properties to something. If you care about various properties having specific values, it's probably worth explicitly setting those values even if they're the current default.

(To be explicit, I think that OpenZFS should make this sort of changes to defaults when they have good reasons, which I feel they definitely did here. Our issues with compression are unusual and specific to our environment, and dealing with it is our problem.)

Some things on how ZFS System Attributes are stored

By: cks

To summarize, ZFS's System Attributes (SAs) are a way for ZFS to pack a somewhat arbitrary collection of additional information, such as the parent directory of things and symbolic link targets, into ZFS dnodes in a general and flexible way that doesn't hard code the specific combinations of attributes that can be used together. ZFS system attributes are normally stored in extra space in dnodes that's called the bonus buffer, but the system attributes can overflow to a spill block if necessary. I've written more about the high level side of this in my entry on ZFS SAs, but today I'm going to write up some concrete details of what you'd see when you look at a ZFS filesystem with tools like zdb.

When ZFS stores the SAs for a particular dnode, it simply packs all of their values together in a blob of data. It knows which part of the blob is which through an attribute layout, which tells it which attributes are in the layout and in what order. Attribute layouts are created and registered as they are needed, which is to say when some dnode wants to use that particular combination of attributes. Generally there are only a few combinations of system attributes that get used, so a typical ZFS filesystem will not have many SA layouts. System attributes are numbered, but the specific numbering may differ from filesystem to filesystem. In practice it probably mostly won't, since most attributes usually get registered pretty early in the life of a ZFS filesystem and in a predictable order.

(For example, the creation of a ZFS filesystem necessarily means creating a directory dnode for its top level, so all of the system attributes used for directories will immediately get registered, along with an attribute layout.)

The attribute layout for a given dnode is not fixed when the file is created; instead, it varies depending on what system attributes that dnode needs at the moment. The high level ZFS code simply sets or clears specific system attributes on the dnode, and the low(er) level system attribute code takes care of either finding or creating an attribute layout that matches the current set of attributes the dnode has. Many system attributes are constant over the life of the dnode, but I think others can come and go, such as the system attributes used for xattrs.

Every ZFS filesystem with system attributes has three special dnodes involved in this process, which zdb will report as the "SA master node", the "SA attr registration" dnode, and the "SA attr layouts" dnode. As far as I know, the SA master node's current purpose is to point to the other two dnodes. The SA attribute registry dnode is where the potentially filesystem specific numbers for attributes are registered, and the SA attribute layouts dnode is where the various layouts in use on the filesystem are tracked. The SA master (d)node itself is pointed to by the "ZFS master node", which is always object 1.

So let's use zdb to take a look at a typical case:

# zdb -dddd fs19-scratch-01/w/430 1
[...]
   Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
        1    1   128K    512     8K     512    512  100.00  ZFS master node
[...]
               SA_ATTRS = 32 
[...]
# zdb -dddd fs19-scratch-01/w/430 32
   Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
       32    1   128K    512      0     512    512  100.00  SA master node
[...]
               LAYOUTS = 36 
               REGISTRY = 35 

It's common for the registry and the layout to be consecutive, since they're generally allocated at the same time. On most filesystems they will have very low object numbers, since they were created when the filesystem was.

The registry is generally going to be pretty boring looking:

# zdb -dddd fs19-scratch-01/w/430 35
[...]
   Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
       35    1   128K  1.50K     8K     512  1.50K  100.00  SA attr registration
[...]
       ZPL_SCANSTAMP =  20030012 : [32:3:18]
       ZPL_RDEV =  800000a : [8:0:10]
       ZPL_FLAGS =  800000b : [8:0:11]
       ZPL_GEN =  8000004 : [8:0:4]
       ZPL_MTIME =  10000001 : [16:0:1]
       ZPL_CTIME =  10000002 : [16:0:2]
       ZPL_XATTR =  8000009 : [8:0:9]
       ZPL_UID =  800000c : [8:0:12]
       ZPL_ZNODE_ACL =  5803000f : [88:3:15]
       ZPL_PROJID =  8000015 : [8:0:21]
       ZPL_ATIME =  10000000 : [16:0:0]
       ZPL_SIZE =  8000006 : [8:0:6]
       ZPL_LINKS =  8000008 : [8:0:8]
       ZPL_PARENT =  8000007 : [8:0:7]
       ZPL_MODE =  8000005 : [8:0:5]
       ZPL_PAD =  2000000e : [32:0:14]
       ZPL_DACL_ACES =  40013 : [0:4:19]
       ZPL_GID =  800000d : [8:0:13]
       ZPL_CRTIME =  10000003 : [16:0:3]
       ZPL_DXATTR =  30014 : [0:3:20]
       ZPL_DACL_COUNT =  8000010 : [8:0:16]
       ZPL_SYMLINK =  30011 : [0:3:17]

The names of these attributes come from the enum of known system attributes in zfs_sa.h. The important bit of the values of them is the '[16:0:1]' portion, which is a decoded version of the raw number. The format of the raw number is covered in sa_impl.h, but the short version is that the first number is the total length of the attribute's value, in bytes, the third is its attribute number within the filesystem, and then middle number is an index of how to byteswap it if necessary (and sa.c has a nice comment about the whole scheme at the top).

(The attributes with a listed size of 0 store their data in extra special ways that are beyond the scope of this entry.)

The more interesting thing is the SA attribute layouts:

# zdb -dddd fs19-scratch-01/w/430 36
[...]
   Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
       36    1   128K    16K    16K     512    32K  100.00  SA attr layouts
[...]
    2 = [ 5  6  4  12  13  7  11  0  1  2  3  8  21  16  19 ]
    4 = [ 5  6  4  12  13  7  11  0  1  2  3  8  16  19  17 ]
    3 = [ 5  6  4  12  13  7  11  0  1  2  3  8  16  19 ]

This particular filesystem has three attribute layouts that have been used by dnodes, and as you can see they are mostly the same. Layout 3 is the common subset, with all of the basic inode attributes you'd expect in a Unix filesystem; layout 2 adds attribute 21 (ZPL_PROJID), and layout 4 adds attribute 17 (ZPL_SYMLINK).

It's possible to have a lot more layouts than this. Here is the collection of layouts for my home desktop's home directory filesystem (which uses the same registered attribute numbers as the filesystem above, so you can look up there for them):

    4 = [ 5  6  4  12  13  7  11  0  1  2  3  8  16  19  9 ]
    3 = [ 5  6  4  12  13  7  11  0  1  2  3  8  16  19  17 ]
    7 = [ 5  6  4  12  13  7  11  0  1  2  3  8  21  16  19  9 ]
    2 = [ 5  6  4  12  13  7  11  0  1  2  3  8  16  19 ]
    5 = [ 5  6  4  12  13  7  11  0  1  2  3  8  10  16  19 ]
    6 = [ 5  6  4  12  13  7  11  0  1  2  3  8  21  16  19 ]

Incidentally, notice how these layout numbers aren't the same as the layout numbers on the first filesystem; layout 3 on the first filesystem is layout 2 on my home directory filesystem, layout 4 (symlinks) is layout 3, and layout 2 (project ID) is layout 6. The additional layouts in my home directory filesystem add xattrs (id 9) or 'rdev' (id 10) to some combination of the other attributes.

One of the interesting aspects of this is that you can use the SA attribute layouts to tell if a ZFS filesystem definitely doesn't have some sort of files in it. For example, we know that there are no device special files or files with xattrs in /w/430, because there are no SA attribute layouts that include those attributes. And neither of these two filesystems have ever had ACLs set on any of their files, because neither of them have layouts with either SA ACL attributes.

(Attribute layouts are never removed once created, so a filesystem with a layout with the 'rdev' attribute in it may still not have any device special files in it right now; they could all have been removed.)

Unfortunately, I can't see any obvious way to get zdb to tell you what the current attribute layout is for a specific dnode. At best you have to try to deduce it from what 'zdb -dddd' will print for the dnode's attributes.

(I've recently acquired a reason to dig into the details of ZFS system attributes.)

Sidebar: A brief digression on xattrs in ZFS

As covered in zfsprops(7)'s section on 'xattr=', there are two storage schemes for xattrs in ZFS (well, in OpenZFS on Linux and FreeBSD). At the attribute level, 'ZPL_XATTR' is the older, more general 'store it in directories and files' approach, while 'ZPL_DXATTR' is the 'store it as part of system attributes' one ('xattr=sa'). When dumping a dnode in zdb, zdb will directly print SA xattrs, but for directory xattrs it simply reports 'xattr = <object id>', where the object ID is for the xattr directory. To see the names of the xattrs set on such a file, you need to also dump the xattr directory object with zdb.

(Internally the SA xattrs are stored as a nvlist, because ZFS loves nvlists and nvpairs, more or less because Solaris did at the time.)

ZFS's transactional guarantees from a user perspective

By: cks

I said recently on the Fediverse that ZFS's transactional guarantees were rather complicated both with and without fsync(). I've written about these before in terms of transaction groups and the ZFS Intent Log (ZIL), but that obscured the user visible behavior under the technical details. So here's an attempt at describing just the visible behavior, hopefully in a way that people can follow despite how it gets complicated.

ZFS has two levels of transactional behavior. The basic layer is what happens when you don't use fsync() (or the filesystem is ignoring it). At this level, all changes to a ZFS filesystem are strongly ordered by the time they happened. ZFS may lose some activity at the end, but if you did operation A before operation B and there is a crash, the possible options of what is there afterward is nothing, A, or A and B; you can never have B without A. This strictly time ordered view of filesystem changes is periodically flushed to disk by ZFS; in modern ZFS, such a flush is typically started every five seconds (although completing a flush can take some time). This is generally called a transaction group (txg) commit.

The second layer of transactional behavior comes in if you fsync() something. When you fsync() something (and fsync is enabled on the filesystem, which is the default), all uncommitted metadata changes are immediately flushed to disk along with whatever uncommitted file data changes you requested a fsync() for (if you fsync'd a file instead of a directory). If several processes request fsync()s at once, all of their requests will be merged together, so a single immediate flush may include data for multiple files. Uncommitted file changes that no one requested a fsync() for will not be immediately flushed and will instead wait for the next regular non-fsync() flush (the next txg commit).

(This is relatively normal behavior for fsync(), except that on most filesystems a fsync() doesn't immediately flush all metadata changes. Metadata changes include things like creating, renaming, or removing files.)

A fsync() can break the strict time order of ZFS changes that exists in the basic layer. If you write data to A, write data to B, fsync() B but not A, and ZFS crashes immediately, the data for B will still be there but the change to A may have been lost. In some situations this can result in zero length files even though they were intended to have data. However, if enough time goes by everything from before the fsync() will have been flushed out as part of the non-fsync() flush process.

As a technical detail, ZFS makes it so that all of the changes that are part of a particular periodic flush are tied to each other (if there have been no fsyncs to meddle with the ordering); either all of them will appear after a crash or none of them will. This can be used to create atomic groups of changes that will always appear together (or be lost together), by making sure that all changes are part of the same periodic flush (in ZFS jargon, they are part of the same transaction group (txg)). However, ZFS doesn't give programs any explicit way to do this, and this atomic grouping can be messed up if someone fsync()s at an inconvenient time.

The flow of activity in the ZFS Intent Log (as I understand it)

By: cks

The ZFS Intent Log (ZIL) is a confusing thing once you get into the details, and for reasons beyond the scope of this entry I recently needed to sort out the details of some aspects of how it works. So here is what I know about how things flow into the ZIL, both in memory and then on to disk.

(As always, there is no single 'ZFS Intent Log' in a ZFS pool. Each dataset (a filesystem or a zvol) has its own logically separate ZIL. We talk about 'the ZIL' as a convenience.)

When you perform activities that modify a ZFS dataset, each activity creates its own ZIL log record (a transaction in ZIL jargon, sometimes called an 'itx', probably short for 'intent transaction') that is put into that dataset's in-memory ZIL log. This includes both straightforward data writes and metadata activity like creating or renaming files. You can see a big list of all of the possible transaction types in zil.h as all of the TX_* definitions (which have brief useful comments). In-memory ZIL transactions aren't necessarily immediately flushed to disk, especially for things like simply doing a write() to a file. The reason that plain write()s to a file are (still) given ZIL transactions is that you may call fsync() on the file later. If you don't call fsync() and the regular ZFS transaction group commits with your write()s, those ZIL transactions will be quietly cleaned out of the in-memory ZIL log (along with all of the other now unneeded ZIL transactions).

(All of this assumes that your dataset doesn't have 'sync=disabled' set, which turns off the in-memory ZIL as one of its effects.)

When you perform an action such as fsync() or sync() that requests that in-memory ZFS state be made durable on disk, ZFS gathers up some or all of those in-memory ZIL transactions and writes them to disk in one go, as a sequence of log (write) blocks ('lwb' or 'lwbs' in ZFS source code), which pack together those ZIL transaction records. This is called a ZIL commit. Depending on various factors, the flushed out data you write() may or may not be included in the log (write) blocks committed to the (dataset's) ZIL. Sometimes your file data will be written directly into its future permanent location in the pool's free space (which is safe) and the ZIL commit will have only a pointer to this location (its DVA).

(For a discussion of this, see the comments about the WR_* constants in zil.h. Also, while in memory, ZFS transactions are classified as either 'synchronous' or 'asynchronous'. Sync transactions are always part of a ZIL commit, but async transactions are only included as necessary. See zil_impl.h and also my entry discussing this.)

It's possible for several processes (or threads) to all call sync() or fsync() at once (well, before the first one finishes committing the ZIL). In this case, their requests can all be merged together into one ZIL commit that covers all of them. This means that fsync() and sync() calls don't necessarily match up one to one with ZIL commits. I believe it's also possible for a fsync() or sync() to not result in a ZIL commit if all of the relevant data has already been written out as part of a regular ZFS transaction group (or a previous request).

Because of all of this, there are various different ZIL related metrics that you may be interested in, sometimes with picky but important differences between them. For example, there is a difference between 'the number of bytes written to the ZIL' and 'the number of bytes written as part of ZIL commits', since the latter would include data written directly to its final space in the main pool. You might care about the latter when you're investigating the overall IO impact of ZIL commits but the former if you're looking at sizing a separate log device (a 'slog' in ZFS terminology).

❌