Kubernetes Resource Requests Are a Massive Footgun

If you have Kubernetes workloads that configure Resource Requests on Pods or Containers there’s a footgun “hidden” in a sentence in the documentation (kudos if you spot it immediately):

[…] The kubelet also reserves at least the request amount of that system resource specifically for that container to use. […]

This means Resource Requests actually reserve the requested amount of resources exclusively. To emphasize: this is not a fairness measure in case of over-provisioning! So, if there are Resource Requests you can’t “overprovision” your node/cluster … hell, the new pod won’t even be scheduled although your node is sitting idle. 😵😓

By the time you find out why and have patched the offending resources you’ll be swearing up and down. 🤬

Oh … and wait till you see what the Internat has to say about Resource Limits. 😰

Configuring Custom Ingress Ports With Cilium

This is just a note for anyone looking for a solution to this problem.

While it’s extremely easy with the Kubernetes’ newer Gateway API via listeners on Gateway resources it seems the Ingress resources were always meant to be used with (global?) default ports … mainly 80 and 443 for HTTP and HTTPS respectively. So every Ingress Controller seems to have their own “side-channel solution” that leverages some resource metadata to convey this information. For Cilium this happens to be the sparsely documented ingress.cilium.io/host-listener-port annotation.

So your Ingress definition should look something like this:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ...
  namespace: ...
  annotations:
    ingress.cilium.io/host-listener-port: 1234
spec:
  ingressClassName: cilium
  rules:
  - http: ...

Running k3s on Incus

I know the pain to manage a bunch of services on my own. Even with relying on Incus, Podman and Systemd as much as possible held together by lot’s of Ansible duct tape: it’s still arduous. I convinced myself change was in order: … something something Kubernetes.

My main criteria are basically:

  • Must be able to run on a single node (for now). i.e. no clustered services or databases. (k3s looks like it fits the bill)
  • Services must be able to be deployed with public service definitions (Helm FTW)
  • These service definitions must lend themselves to be version controlled
  • All relevant data directories must live on a separate ZFS datasets

Running k3s in an Incus container

You can run k3s in an Incus container, but it gets increasingly difficult. There’re reports of people getting it to run, but it gets increasingly difficult. Even public LXD/LXC definitions for microk8s or k3s are either quite old (as of 2025-08 3 and 6 years old respectively) and blast HUGE holes in the sandbox. ☹️ K3s “requires” access to /dev/kmsg, several places in /proc and /sys as well as modprobing several kernel modules (it checks for access to them and spams the logs with warnings and errors). 😶

It looks doable in a technical sense, but it’s a huge pain having to go though Incus, without any of the (sandboxing/security) benefits. So the general wisdom is to just use a VM. (No, I didn’t try k3s’ experimental rootless mode)

Running k3s in an Incus VM

I started with a fresh VM and could reuse my now much simplified Ansible tasks for setting um k3s. But my happiness got cut short by the k3s service spamming the journal with useless

level=error msg="failed to ping connection: disk I/O error: no such device"

error messages.After removing all the directories and files from /var/lib/rancher/k3s and starting the server by hand I got:

Error: preparing server: failed to bootstrap cluster data: creating storage endpoint: failed to create driver for default endpoint: setup db: disk I/O error: no such device

Some more mucking around with the k3s server config revealed a puzzling, but more useful

failed to mount overlay: invalid argument.

Looking at what dmesg had to say I got:

overlayfs: upper fs does not support tmpfile.
overlayfs: failed to set xattr on upper
overlayfs: …falling back to redirect_dir=nofollow.
overlayfs: …falling back to uuid=null.
overlayfs: …falling back to xino=off.
overlayfs: try mounting with 'userxattr' option
overlayfs: upper fs missing required features.

Long story short: it turns out in my eagerness I had mounted a custom Incus volume as k3s’ data directory (/var/lib/rancher/k3s). This being a VM (instead of a container) it mounted the volume using the virtiofs protocol. And it turns out the overlayfs doesn’t like being put on top of virtiofs devices (or NFS it seems). 😵‍💫 But good news: it was fixable, although hacky. I found out by grepping for “virtiofsd” processes that Incus vendors its own virtiofsd binary in /opt/incus/bin/virtiofsd. And it already runs it with the --posix-acl option with implies the required --xattr option. But Incus currently doesn’t support any way for configuring virtiofsd. 😓 So the only solution (by the main Incus maintainer none the less) is to replace /opt/incus/bin/virtiofsd with a shim script calling the real virtiofsd binary with the additional --modcaps=+sys_admin option. Basically something silly like:

#!/usr/bin/bash
exec /opt/incus/bin/virtiofsd.orig --modcaps=+sys_admin "$@"

Yeah also, “try mounting with ‘userxattr’ option” was not helpful and sent me down the wrong path. 🤐

All in all … all these stumbling blocks ate my weekend. Which was kind of in line with my prejudices against Kubernetes. 😅

Running Circles Around Detecting Containers

Recently my monitoring service warned me that my Raspberry Pi was not syncing its time any more. I logged into the devices and tried restarting systemd-timesyncd.service and it failed.

The error it presented was:

ConditionVirtualization=!container was not met

I was confused. Although I was running containers on this device, this was on the host! 😯

I checked the service definition and it indeed had this condition. Then I tried to look up the docs for the ContainerVirtualization setting and found out Systemd has a helper command that can be used to find out if it has been run inside a Container/VM/etc.

To my surprise running systemd-detect-virt determined it was being run inside a Podman container, although it was run on the host. I was totally confused. Does it detect any Container or being run in one? 😵‍💫

I tried to dig deeper, but the docs only tell you what known Container/VM solutions can be detected, but not what it uses to do so. So I searched the code of systemd-detect-virt for indications how it tried to detect Podman containers … and I found it: it looks for the existence of a file at /run/.containerenv. 😯

Looking whether this file existed on the host I found out: it did!!! 😵 How could this be? I checked another device running Podman and the file wasn’t there!?! 😵‍💫 … Then it dawned on me. I was running cAdvisor on the Raspberry Pi and it so happens that it wants /var/run to be mounted inside the container, /var/run just links to /run and independent of me mounting it read-only it creates the /run/.containerenv file!!! 🤯

I looked into /run/.containerenv and found out it was empty, so I removed it and could finally restart systemd-timesyncd.service. The /run/.containerenv file is recreated on every restart of the container, but at least I know what to look for. 😩

Update 2025-03-23: I’ve created a bug report: https://github.com/containers/podman/issues/25655

Dropbear vs SSH woes between Ubuntu LTSes

Imagine you’re using dropbear-initrd to log in to a server during boot for unlocking the hard disk encryption and you’re greeted with the following error after a reboot:

root@server: Permission denied (publickey).

🤨😓😖 You start to sweat … this looks like extra work you didn’t need right now. You try to remember: were there any updates lately that could have messed up the initrd? … deep breath, lets take it slowly.

First try to get SSH to spit out more details:

$ ssh -vvv server-boot
[...]
debug1: Next authentication method: publickey
debug1: Offering public key: /home/user/.ssh/... RSA SHA256:... explicit
debug1: send_pubkey_test: no mutual signature algorithm
[...]

That doesn’t seem right … this worked before. The server is running Ubuntu 20.04 LTS and I’ve just upgraded my work machine to Ubuntu 22.04 LTS. I know that Dropbear doesn’t support ed25519 keys (at least not on the version on the server), that’s why I still use RSA keys for that. 🤔

Time to ask the Internet, but all the posts with a “no mutual signature algorithm” error message are years old … but most of them were circling around the SSH client having deprecated old key types (namely DSA keys). 😯

Can it be that RSA keys have also been deprecated? 😱 … I’ve recently upgraded my client machine 😶 … no way! … well, yes! That was exactly the problem.

Allowing RSA keys in the connection settings for that server allowed me to log in again 😎:

PubkeyAcceptedKeyTypes +ssh-rsa

But this whole detour unnecessarily wasted an hour of my life. 😓

My First Container-based Postgres Upgrade

Yesterday I did my first container-based PostgreSQL version upgrade. In my case the upgrade was from version 13 to 14. In hindsight I was quite naïve. 😅

I was always wondering why distros kept separate data directories for different versions … now I know: you can’t do in-place upgrades with PostgreSQL. You need to have separate data directories as well as both version’s binaries. 😵 Distros have their mechanisms for it, but in the container world you’re kind of on your own.

Well not really … it’s just different. I found there’s a project that specializes in exactly the tooling part of the upgrade. After a little trial an error (see below) it went quite smoothly.

Procedure

In the end it came down to the following steps:

  1. Stop the old postgres container.
  2. Backup the old data directory (yay ZFS snapshots).
  3. Create the new postgres container (with a new data directory; in my case via Ansible)
  4. Stop the new postgres container.
  5. Run the upgrade. (see command below)
  6. Start the new postgres container.
  7. Run vacuumdb as suggested at the end of the upgrade. (see command below)

The Upgrade Command

I used the tianon/postgres-upgrade container for the upgrade. Since my directory layout didn’t follow the “default” structure I had to mount each version’s data directory separately.

docker run --rm \
-e POSTGRES_INITDB_ARGS="--no-locale --encoding=UTF8" \
-v /tmp/pg_upgrade:/var/lib/postgresql \
-v /tank/containers/postgres-13:/var/lib/postgresql/13/data \
-v /tank/containers/postgres-14:/var/lib/postgresql/14/data \
tianon/postgres-upgrade:13-to-14

I set the POSTGRES_INITDB_ARGS to what I used when creating the new Postgres container’s data directory. This shouldn’t be necessary because we let the new Postgres container initialize the data directory. (see below) I left it in just to be safe. 🤷

I explicitly mounted something to the container’s /var/lib/postgresql directory in order to have access to the upgrade logs which are mentioned in error messages. (see below)

The Vacuumdb Command

Upgrading finishes with a suggestion like:

Upgrade Complete
—————-
Optimizer statistics are not transferred by pg_upgrade.
Once you start the new server, consider running:
/usr/lib/postgresql/14/bin/vacuumdb –all –analyze-in-stages

We can run the command in the new Postgres container:

docker exec postgres vacuumdb -U postgres --all --analyze-in-stages

We use the postgres user, because we didn’t specify a POSTGRES_USER when creating the database container.

Pitfalls

When you’re not using the default directory structure there’re some pitfalls. Mounting the two versions’ data directories separately is easy enough … it says so in the README. It’s what it doesn’t say that makes it more difficult than necessary. 😞

Errors When Initializing the New Data Directory

The first error I encountered was that the new data directory would get initialized with the default initdb options. where I used an optimized cargo-culted incantation which was incompatible (in my case --no-locale --encoding=UTF8). The upgrade failed with the following error:

lc_collate values for database “postgres” do not match: old “C”, new “en_US.utf8”

So I made sure I created the new database container (with the correct initdb args) before the migration fixed this.

Extra Mounts for the Upgrade

What tripped me really up was that when something failed it said to look into a specific log file which I couldn’t find. 🤨 I had to also mount something to the /var/lib/postgres directory which then had all the upgrade log files. 😔

This also solved another of my problems where the upgrade tool wanted to start an instance of the Postgres database, but failed because it couldn’t find a specific socket … which also happens to be located in the directory mentioned above.

Authentication Errors After Upgrade

After the upgrade I had a lot of authentication errors although non of the passwords should have changed.

FATAL: password authentication failed for user “nextcloud”

After digging through the internet and comparing both the old and new data directories it looked like the password hashing method changed. It changed from md5 to scram-sha-256 (in pg_ hda.conf the line saying host all all all scram-sha-256). 😑Just re-setting (i.e. setting the same passwords again) via ALTER ROLE foo SET PASSWORD '...'; on all users fixed the issue.🤐

Moving LXD Containers From One Pool to Another

When I started playing with LXD I just accepted the default storage configuration which creates an image file and uses that to initialize a ZFS pool. Since I’m using ZFS as my main file system this seemed silly as LXD can use an existing dataset as a source for a storage pool. So I wanted to migrate my existing containers to the new storage pool.

Although others seemed to to have the same problem there was no ready answer. Digging through the documentation I finally found out that the lxc move  command had a -s  option … I had an idea. ? Here’s what I came up with …

Preparation

First we create the dataset on the existing ZFS pool and add it to LXC.

sudo zfs create -o mountpoint=none mypool/lxd
lxc storage create pool2 zfs source=mypool/lxd

lxc storage list should show something like this now:

+-------+-------------+--------+--------------------+---------+
| NAME  | DESCRIPTION | DRIVER |       SOURCE       | USED BY |
+-------+-------------+--------+--------------------+---------+
| pool1 |             | zfs    | /path/to/pool1.img | 2       |
+-------+-------------+--------+--------------------+---------+
| pool2 |             | zfs    | mypool/lxd         | 0       |
+-------+-------------+--------+--------------------+---------+

pool1 is the old pool backed by the image file and is used by some containers at the moment as can be seen in the “Used By” column. pool2 is added by not used by any contaiers yet.

Moving

We now try to move our containers to pool2.

# move container to pool2
lxc move some_container some_container-moved -s=pool2
# rename container back for sanity ;)
lxc move some_container-moved some_container

We can check with lxc storage list whether we succeeded.

+-------+-------------+--------+--------------------+---------+
| NAME  | DESCRIPTION | DRIVER |       SOURCE       | USED BY |
+-------+-------------+--------+--------------------+---------+
| pool1 |             | zfs    | /path/to/pool1.img | 1       |
+-------+-------------+--------+--------------------+---------+
| pool2 |             | zfs    | mypool/lxd         | 1       |
+-------+-------------+--------+--------------------+---------+

Indeed pool2 is beeing used now. ? Just to be sure we check that zfs list -r mypool/lxd  also reflects this.

NAME                                  USED  AVAIL  REFER  MOUNTPOINT
mypool/lxd/containers                 1,08G  92,9G    24K  none
mypool/lxd/containers/some_container  1,08G  92,9G   704M  /var/snap/lxd/common/lxd/storage-pools/pool2/containers/some_container
mypool/lxd/custom                       24K  92,9G    24K  none
mypool/lxd/deleted                      24K  92,9G    24K  none
mypool/lxd/images                       24K  92,9G    24K  none
mypool/lxd/snapshots                    24K  92,9G    24K  none

Awesome!

⚠ Note that this only moves the container, but not the LXC image it was cloned off of.

We can repeat this until all containers we care about are moved over to pool2.

Cleanup

To prevent new containers to use pool1  we have to edit the default  profile.

# change devices.root.pool to pool2
lxc profile edit default

Finally …. when we’re happy with the migration and we’ve verified that everything works as expected we can now remove pool1.

lxc storage rm pool1

 

CFSSL FTW

After reading how CloudFlare handles their PKI and that LetsEncrypt will use it I wanted to give CFSSL a shot.

Reading the project’s documentation doesn’t really help in building your own CA, but searching the Internet I found Fernando Barillas’ blog explaining how to create your own root certificate and how to create intermediate certificates from this.

I took it a step further I wrote a script generating new certificates for several services with different intermediates and possibly different configurations (e.g. depending on your distro and services certain cyphers (e.g. using ECC) may not be supported).
I also streamlined generating service specific key, cert and chain files. 😀

Have a look at the full Gist or just the most interesting part:

You’ll still have to deploy them yourself.

Update 2016-10-04:
Fixed some issues with this Gist.

  • Fixed a bug where intermediate CA certificates weren’t marked as CAs any more
  • Updated the example CSRs and the script so it can now be run without errors

Update 2017-10-08:

  • Cleaned up `renew-certs.sh` by extracting functions for generating root CA, intermediate CA and service keys.

too long for Unix domain socket

If you’re an Ansible user and encounter the following error:

unix_listener: "..." too long for Unix domain socket

you need to set the control_path option in your ansible.cfg file to tell SSH to use shorter path names for the control socket. You should have a look at the ssh_config(5) man page  (under

ControlPath

) for a list of possible substitutions.

I chose:

control_path = %(directory)s/ssh-%%C