Fixing Dracut for Encrypted ZFS on Root on Ubuntu 25.10

I just upgraded from Ubuntu 25.04 to 25.10 … well it was more of a reinstall really. Because I knew the new release changed the initrd-related tools to Dracut I tried to understand all the changes from a test installation in a VM. Well, I still somehow broke Dracut’s ability to unlock my encrypted ZFS on root setup automatically.

Looking at journalctl it claimed it couldn’t find the key file:

dracut-pre-mount[940]: Warning: ZFS: Key /run/keystore/rpool/system.key for rpool/enc hasn't appeared. Trying anyway.
[...]
dracut-pre-mount[1001]: Key load error: Failed to open key material file: No such file or directory
[...]
systemd[1]: Mounting sysroot.mount - /sysroot...
mount[1007]: zfs_mount_at() failed: encryption key not loaded
systemd[1]: sysroot.mount: Mount process exited, code=exited, status=2/INVALIDARGUMENT
systemd[1]: sysroot.mount: Failed with result 'exit-code'.
systemd[1]: Failed to mount sysroot.mount - /sysroot.
systemd[1]: Dependency failed for initrd-root-fs.target - Initrd Root File System.

All I could do was mounting the keystore manually in the emergency console:

systemd-cryptsetup attach keystore-rpool /dev/zvol/rpool/keystore
mkdir -p /run/keystore/rpool
mount /dev/mapper/keystore-rpool /run/keystore/rpool

After pressing Ctrl-d Systemd continued booting as if everything was OK. This worked, but was HUGELY annoying, especially considering it was also using an English keyboard mapping. 🤬

After I was done setting up my desktop I took the time investigate the issue. I compared all the things between my real system and the freshly setup VM. After comparing the system startup plots (exported with systemd-analyze plot > plot.svg) I noticed that the systemd-ask-password.service would start quite late in my real system (after I manually mounted the keystore). I knew there was a bug report for teaching Dracut Ubuntu’s ZFS on root encryption scheme (i.e. putting the root ZFS dataset’s encryption keys in a LUKS container on a Zvol (rpool/keystore)). So I looked at the actual patch and tried to walk through of how it would behave on my system. There I noticed that the script actually assumes the ZFS encryption root to be the same as the Zpool’s root dataset (e.g. rpool). 😯 I moved away from this kind of setup years ago as it makes restoring from a backup quite cumbersome. So I was using a sub-dataset for the encrypted data (e.g. root/crypt) which messed up the logic which assumed it to only contain the pool name. 🤦‍♂️

Long story short the following patch determines the pool name of the encryption root before trying to open and mount the LUKS keystore:

--- zfs-load-key.sh.orig        2025-10-16 20:44:47.955349974 +0200
+++ zfs-load-key.sh     2025-10-16 20:55:00.229000464 +0200
@@ -54,9 +54,11 @@
     [ "$(zfs get -Ho value keystatus "${ENCRYPTIONROOT}")" = "unavailable" ] || return 0

     KEYLOCATION="$(zfs get -Ho value keylocation "${ENCRYPTIONROOT}")"
+    # `ENCRYPTIONROOT` might not be the root dataset (e.g. `rpool/enc`)
+    ENCRYPTIONROOT_POOL="$(echo "${ENCRYPTIONROOT}" | cut -d/ -f1)"
     case "$KEYLOCATION" in
-        "file:///run/keystore/${ENCRYPTIONROOT}/"*)
-            _open_and_mount_luks_keystore "${ENCRYPTIONROOT}" "${KEYLOCATION#file://}"
+        "file:///run/keystore/${ENCRYPTIONROOT_POOL}/"*)
+            _open_and_mount_luks_keystore "${ENCRYPTIONROOT_POOL}" "${KEYLOCATION#file://}"
             ;;
     esac

🎉

Solved: Regular WiFi Disconnections

After upgrading to Ubuntu 21.10 I noticed that my WiFi was disconnecting semi-regularly and trying to reconnect multiple times without success until after roughly 1 minute it was connected again. I’ve never had this kind of issues even after switching to iwd several years back.

Trying to find out what was happening I looked into my WiFi daemons logs with journalctl -eu iwd. Around the time a disconnect happened there were roughly 30 lines each 1-2 seconds apart all saying:

Received Deauthentication event, reason: 4, from_ap: false

There was no other information warning or error. The timings between these blocks are not always consistent. Mostly it’s 2 hours or 2 hours 15 minutes.

NetworkManager’s logs only had this to say (also repeated multiple times):

device (wlan0): new IWD device state is disconnected
device (wlan0): state change: activated -> failed (reason 'supplicant-disconnect', sys-iface-state: 'managed')
manager: NetworkManager state is now DISCONNECTED
device (wlan0): Activation: failed for connection '<SSID>'
device (wlan0): state change: failed -> disconnected (reason 'none', sys-iface-state: 'managed')
dhcp4 (wlan0): canceled DHCP transaction
dhcp4 (wlan0): state changed bound -> terminated
dhcp6 (wlan0): canceled DHCP transaction
dhcp6 (wlan0): state changed bound -> terminated
device (wlan0): new IWD device state is connecting
device (wlan0): Activation: starting connection '<SSID>' (<UUID>)
device (wlan0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
manager: NetworkManager state is now CONNECTING
device (wlan0): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
device (wlan0): new IWD device state is connected
device (wlan0): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
dhcp4 (wlan0): activation: beginning transaction (timeout in 45 seconds)
dhcp4 (wlan0): state changed unknown -> bound, address=<IP ADDRESS>
device (wlan0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
device (wlan0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
device (wlan0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
manager: NetworkManager state is now CONNECTED_LOCAL
manager: NetworkManager state is now CONNECTED_SITE
policy: set '<SSID>' (wlan0) as default for IPv4 routing and DNS
device (wlan0): Activation: successful, device activated.
manager: NetworkManager state is now CONNECTED_GLOBAL

I tried to search the internet for the error message from iwd, but nothing useful came up. Just forum posts with random suggestions like “turn off IPv6” and old kernel bug reports about the Intel wireless drivers exhibiting similar disconnections. 😞

I tried to look up what a modern setup + configuration should look like and tried to combine the guides from Ubuntu and Arch … no change.

The Gentoo Wiki and the iwd Wiki also suggest (wifi.iwd.autoconnect=yes) for NetworkManager. After applying this setting the issue seems to have gone away. 😀

Update 2021-10-31

The last “fix” only lasted for about 9 hours. 😕

I’m now suspecting something with power management 🤔 … I’m still investigating.

Update 2021-11-26

I have a configuration that seems to work for a few weeks now. My final system state was adding

[device]
wifi.backend=iwd
wifi.iwd.autoconnect=yes

to NetworkManager’s configuration, masking wpa_supplicant.service, commenting out the contents of /etc/NetworkManager/conf.d/default-wifi-powersave-on.conf, installing all recommended packages for TLP (i.e. also tlp-rdw) and finally masking Gnome 40’s new power-profiles-daemon.service which seems to interfere with TLP.

I’m not sure if this is the minimal set of changes necessary, but it works for me. 😀 It first started showing the dis-/reconnection notification, but e.g. SSH connections didn’t seem to drop and after a while even the notifications stopped. Logs are also clean. I’m happy. 😀