Fixing Dracut for Encrypted ZFS on Root on Ubuntu 25.10

I just upgraded from Ubuntu 25.04 to 25.10 … well it was more of a reinstall really. Because I knew the new release changed the initrd-related tools to Dracut I tried to understand all the changes from a test installation in a VM. Well, I still somehow broke Dracut’s ability to unlock my encrypted ZFS on root setup automatically.

Looking at journalctl it claimed it couldn’t find the key file:

dracut-pre-mount[940]: Warning: ZFS: Key /run/keystore/rpool/system.key for rpool/enc hasn't appeared. Trying anyway.
[...]
dracut-pre-mount[1001]: Key load error: Failed to open key material file: No such file or directory
[...]
systemd[1]: Mounting sysroot.mount - /sysroot...
mount[1007]: zfs_mount_at() failed: encryption key not loaded
systemd[1]: sysroot.mount: Mount process exited, code=exited, status=2/INVALIDARGUMENT
systemd[1]: sysroot.mount: Failed with result 'exit-code'.
systemd[1]: Failed to mount sysroot.mount - /sysroot.
systemd[1]: Dependency failed for initrd-root-fs.target - Initrd Root File System.

All I could do was mounting the keystore manually in the emergency console:

systemd-cryptsetup attach keystore-rpool /dev/zvol/rpool/keystore
mkdir -p /run/keystore/rpool
mount /dev/mapper/keystore-rpool /run/keystore/rpool

After pressing Ctrl-d Systemd continued booting as if everything was OK. This worked, but was HUGELY annoying, especially considering it was also using an English keyboard mapping. 🤬

After I was done setting up my desktop I took the time investigate the issue. I compared all the things between my real system and the freshly setup VM. After comparing the system startup plots (exported with systemd-analyze plot > plot.svg) I noticed that the systemd-ask-password.service would start quite late in my real system (after I manually mounted the keystore). I knew there was a bug report for teaching Dracut Ubuntu’s ZFS on root encryption scheme (i.e. putting the root ZFS dataset’s encryption keys in a LUKS container on a Zvol (rpool/keystore)). So I looked at the actual patch and tried to walk through of how it would behave on my system. There I noticed that the script actually assumes the ZFS encryption root to be the same as the Zpool’s root dataset (e.g. rpool). 😯 I moved away from this kind of setup years ago as it makes restoring from a backup quite cumbersome. So I was using a sub-dataset for the encrypted data (e.g. root/crypt) which messed up the logic which assumed it to only contain the pool name. 🤦‍♂️

Long story short the following patch determines the pool name of the encryption root before trying to open and mount the LUKS keystore:

--- zfs-load-key.sh.orig        2025-10-16 20:44:47.955349974 +0200
+++ zfs-load-key.sh     2025-10-16 20:55:00.229000464 +0200
@@ -54,9 +54,11 @@
     [ "$(zfs get -Ho value keystatus "${ENCRYPTIONROOT}")" = "unavailable" ] || return 0

     KEYLOCATION="$(zfs get -Ho value keylocation "${ENCRYPTIONROOT}")"
+    # `ENCRYPTIONROOT` might not be the root dataset (e.g. `rpool/enc`)
+    ENCRYPTIONROOT_POOL="$(echo "${ENCRYPTIONROOT}" | cut -d/ -f1)"
     case "$KEYLOCATION" in
-        "file:///run/keystore/${ENCRYPTIONROOT}/"*)
-            _open_and_mount_luks_keystore "${ENCRYPTIONROOT}" "${KEYLOCATION#file://}"
+        "file:///run/keystore/${ENCRYPTIONROOT_POOL}/"*)
+            _open_and_mount_luks_keystore "${ENCRYPTIONROOT_POOL}" "${KEYLOCATION#file://}"
             ;;
     esac

🎉

Running k3s on Incus

I know the pain to manage a bunch of services on my own. Even with relying on Incus, Podman and Systemd as much as possible held together by lot’s of Ansible duct tape: it’s still arduous. I convinced myself change was in order: … something something Kubernetes.

My main criteria are basically:

  • Must be able to run on a single node (for now). i.e. no clustered services or databases. (k3s looks like it fits the bill)
  • Services must be able to be deployed with public service definitions (Helm FTW)
  • These service definitions must lend themselves to be version controlled
  • All relevant data directories must live on a separate ZFS datasets

Running k3s in an Incus container

You can run k3s in an Incus container, but it gets increasingly difficult. There’re reports of people getting it to run, but it gets increasingly difficult. Even public LXD/LXC definitions for microk8s or k3s are either quite old (as of 2025-08 3 and 6 years old respectively) and blast HUGE holes in the sandbox. ☹️ K3s “requires” access to /dev/kmsg, several places in /proc and /sys as well as modprobing several kernel modules (it checks for access to them and spams the logs with warnings and errors). 😶

It looks doable in a technical sense, but it’s a huge pain having to go though Incus, without any of the (sandboxing/security) benefits. So the general wisdom is to just use a VM. (No, I didn’t try k3s’ experimental rootless mode)

Running k3s in an Incus VM

I started with a fresh VM and could reuse my now much simplified Ansible tasks for setting um k3s. But my happiness got cut short by the k3s service spamming the journal with useless

level=error msg="failed to ping connection: disk I/O error: no such device"

error messages.After removing all the directories and files from /var/lib/rancher/k3s and starting the server by hand I got:

Error: preparing server: failed to bootstrap cluster data: creating storage endpoint: failed to create driver for default endpoint: setup db: disk I/O error: no such device

Some more mucking around with the k3s server config revealed a puzzling, but more useful

failed to mount overlay: invalid argument.

Looking at what dmesg had to say I got:

overlayfs: upper fs does not support tmpfile.
overlayfs: failed to set xattr on upper
overlayfs: …falling back to redirect_dir=nofollow.
overlayfs: …falling back to uuid=null.
overlayfs: …falling back to xino=off.
overlayfs: try mounting with 'userxattr' option
overlayfs: upper fs missing required features.

Long story short: it turns out in my eagerness I had mounted a custom Incus volume as k3s’ data directory (/var/lib/rancher/k3s). This being a VM (instead of a container) it mounted the volume using the virtiofs protocol. And it turns out the overlayfs doesn’t like being put on top of virtiofs devices (or NFS it seems). 😵‍💫 But good news: it was fixable, although hacky. I found out by grepping for “virtiofsd” processes that Incus vendors its own virtiofsd binary in /opt/incus/bin/virtiofsd. And it already runs it with the --posix-acl option with implies the required --xattr option. But Incus currently doesn’t support any way for configuring virtiofsd. 😓 So the only solution (by the main Incus maintainer none the less) is to replace /opt/incus/bin/virtiofsd with a shim script calling the real virtiofsd binary with the additional --modcaps=+sys_admin option. Basically something silly like:

#!/usr/bin/bash
exec /opt/incus/bin/virtiofsd.orig --modcaps=+sys_admin "$@"

Yeah also, “try mounting with ‘userxattr’ option” was not helpful and sent me down the wrong path. 🤐

All in all … all these stumbling blocks ate my weekend. Which was kind of in line with my prejudices against Kubernetes. 😅

Howto Restore ZFS Encryption Hierarchies

Backing up encrypted ZFS datasets you’ll see that ZFS breaks up the encryption hierarchy. The backed up datasets will look like they’ve all been encrypted separately. You can still use the (same) original key to unlock all the datasets, but you’ll have to unlock them separately. 😐

This howto should help you bring them back together when you have to restore from a backup.

Assuming we’ve created a new and encrypted pool to restore the previous backup to (I’ll call it new_rpool). We send our data from the backup pool to new_rpool.

sudo zfs send -w -v backup/laptop/rpool/ROOT@zrepl_20210131_223653_000 | sudo zfs receive -v new_rpool/ROOT
sudo zfs send -w -v backup/laptop/rpool/ROOT/ubuntu@zrepl_20210402_113057_000 | sudo zfs receive -v new_rpool/ROOT/ubuntu
[...]

Note that we’re using zfs send -w which sends the encrypted blocks “as is” from the backup pool to new_pool. This means that these datasets can only be decrypted with the key they were originally encrypted with.

Also note that you cannot restore an encrypted root/pool dataset with another encrypted one: i.e. we can’t restore the contents/snapshots of rpool to new_rpool (at least not without decrypting them first on the sender, sending them unencrypted and reencrypting them upon receive). Luckily for me that dataset is empty. 😎

Anyway … our new pool should look something like this now:

$ zfs list -o name,encryption,keystatus,keyformat,keylocation,encryptionroot -t filesystem,volume -r new_rpool
NAME                   ENCRYPTION   KEYSTATUS    KEYFORMAT   KEYLOCATION  ENCROOT
new_rpool              aes-256-gcm  available    passphrase  prompt       new_rpool
new_rpool/ROOT         aes-256-gcm  unavailable  raw         prompt       new_rpool/ROOT
new_rpool/ROOT/ubuntu  aes-256-gcm  unavailable  raw         prompt       new_rpool/ROOT/ubuntu
[...]

Note that each dataset is treated as it is encrypted by itself (visible in the encryptionroot property). To restore our ability to unlock all datasets with a single key we’ll have to to some work.

First we have to unlock each of these datasets. We can do this with the zfs load-key command (my data was encrypted using a raw key in a file, hence the -L file:///...):

sudo zfs load-key -L file:///tmp/backup.key new_rpool/ROOT
sudo zfs load-key -L file:///tmp/backup.key new_rpool/ROOT/ubuntu
[...]

Although zfs load-key is supposed to have a -r option that works when keylocation=prompt it fails for me with the following error message 🤨:

sudo zfs load-key -r -L file:///tmp/backup.key new_rpool/ROOT

alternate keylocation may only be 'prompt' with -r or -a
usage:
        load-key [-rn] [-L <keylocation>] <-a | filesystem|volume>



For the property list, run: zfs set|get



For the delegated permission list, run: zfs allow|unallow

The keystatus should have changed to available now:

$ zfs list -o name,encryption,keystatus,keyformat,keylocation,encryptionroot -t filesystem,volume -r new_rpool
NAME                   ENCRYPTION   KEYSTATUS    KEYFORMAT   KEYLOCATION  ENCROOT
new_rpool              aes-256-gcm  available    passphrase  prompt       new_rpool
new_rpool/ROOT         aes-256-gcm  available    raw         prompt       new_rpool/ROOT
new_rpool/ROOT/ubuntu  aes-256-gcm  available    raw         prompt       new_rpool/ROOT/ubuntu
[...]

We can now change the encryption keys and hierarchy by inheriting them (similar to regular dataset properties):

sudo zfs change-key -l -i new_rpool/ROOT
sudo zfs change-key -l -i new_rpool/ROOT/ubuntu
[...]

When we list our encryption properties now we can see that all the datasets have the same encryptionroot. This means that unlocking it unlocks all the other datasets as well. 🎉

$ zfs list -o name,encryption,keystatus,keyformat,keylocation,encryptionroot -t filesystem,volume -r new_rpool
NAME                   ENCRYPTION   KEYSTATUS    KEYFORMAT   KEYLOCATION  ENCROOT
new_rpool              aes-256-gcm  available    passphrase  prompt       new_rpool
new_rpool/ROOT         aes-256-gcm  available    passphrase  none         new_rpool
new_rpool/ROOT/ubuntu  aes-256-gcm  available    passphrase  none         new_rpool
[...]

Restoring Dataset Properties

This howto doesn’t touch restoring dataset properties, because I’ve not been able to reliably backup dataset properties using the -p and -b options of zfs send. Therefore I make sure that I have a (manual) backup of the dataset properties with something like `zfs get all -s local > zfs_all_local_properties_$(date -Iminutes).txt`

Moving LXD Containers From One Pool to Another

When I started playing with LXD I just accepted the default storage configuration which creates an image file and uses that to initialize a ZFS pool. Since I’m using ZFS as my main file system this seemed silly as LXD can use an existing dataset as a source for a storage pool. So I wanted to migrate my existing containers to the new storage pool.

Although others seemed to to have the same problem there was no ready answer. Digging through the documentation I finally found out that the lxc move  command had a -s  option … I had an idea. ? Here’s what I came up with …

Preparation

First we create the dataset on the existing ZFS pool and add it to LXC.

sudo zfs create -o mountpoint=none mypool/lxd
lxc storage create pool2 zfs source=mypool/lxd

lxc storage list should show something like this now:

+-------+-------------+--------+--------------------+---------+
| NAME  | DESCRIPTION | DRIVER |       SOURCE       | USED BY |
+-------+-------------+--------+--------------------+---------+
| pool1 |             | zfs    | /path/to/pool1.img | 2       |
+-------+-------------+--------+--------------------+---------+
| pool2 |             | zfs    | mypool/lxd         | 0       |
+-------+-------------+--------+--------------------+---------+

pool1 is the old pool backed by the image file and is used by some containers at the moment as can be seen in the “Used By” column. pool2 is added by not used by any contaiers yet.

Moving

We now try to move our containers to pool2.

# move container to pool2
lxc move some_container some_container-moved -s=pool2
# rename container back for sanity ;)
lxc move some_container-moved some_container

We can check with lxc storage list whether we succeeded.

+-------+-------------+--------+--------------------+---------+
| NAME  | DESCRIPTION | DRIVER |       SOURCE       | USED BY |
+-------+-------------+--------+--------------------+---------+
| pool1 |             | zfs    | /path/to/pool1.img | 1       |
+-------+-------------+--------+--------------------+---------+
| pool2 |             | zfs    | mypool/lxd         | 1       |
+-------+-------------+--------+--------------------+---------+

Indeed pool2 is beeing used now. ? Just to be sure we check that zfs list -r mypool/lxd  also reflects this.

NAME                                  USED  AVAIL  REFER  MOUNTPOINT
mypool/lxd/containers                 1,08G  92,9G    24K  none
mypool/lxd/containers/some_container  1,08G  92,9G   704M  /var/snap/lxd/common/lxd/storage-pools/pool2/containers/some_container
mypool/lxd/custom                       24K  92,9G    24K  none
mypool/lxd/deleted                      24K  92,9G    24K  none
mypool/lxd/images                       24K  92,9G    24K  none
mypool/lxd/snapshots                    24K  92,9G    24K  none

Awesome!

⚠ Note that this only moves the container, but not the LXC image it was cloned off of.

We can repeat this until all containers we care about are moved over to pool2.

Cleanup

To prevent new containers to use pool1  we have to edit the default  profile.

# change devices.root.pool to pool2
lxc profile edit default

Finally …. when we’re happy with the migration and we’ve verified that everything works as expected we can now remove pool1.

lxc storage rm pool1