Imagine you’re using dropbear-initrd to log in to a server during boot for unlocking the hard disk encryption and you’re greeted with the following error after a reboot:
root@server: Permission denied (publickey).
🤨😓😖 You start to sweat … this looks like extra work you didn’t need right now. You try to remember: were there any updates lately that could have messed up the initrd? … deep breath, lets take it slowly.
First try to get SSH to spit out more details:
$ ssh -vvv server-boot
[...]
debug1: Next authentication method: publickey
debug1: Offering public key: /home/user/.ssh/... RSA SHA256:... explicit
debug1: send_pubkey_test: no mutual signature algorithm
[...]
That doesn’t seem right … this worked before. The server is running Ubuntu 20.04 LTS and I’ve just upgraded my work machine to Ubuntu 22.04 LTS. I know that Dropbear doesn’t support ed25519 keys (at least not on the version on the server), that’s why I still use RSA keys for that. 🤔
Time to ask the Internet, but all the posts with a “no mutual signature algorithm” error message are years old … but most of them were circling around the SSH client having deprecated old key types (namely DSA keys). 😯
Can it be that RSA keys have also been deprecated? 😱 … I’ve recently upgraded my client machine 😶 … no way! … well, yes! That was exactly the problem.
Allowing RSA keys in the connection settings for that server allowed me to log in again 😎:
PubkeyAcceptedKeyTypes +ssh-rsa
But this whole detour unnecessarily wasted an hour of my life. 😓
Yesterday I did my first container-based PostgreSQL version upgrade. In my case the upgrade was from version 13 to 14. In hindsight I was quite naïve. 😅
I was always wondering why distros kept separate data directories for different versions … now I know: you can’t do in-place upgrades with PostgreSQL. You need to have separate data directories as well as both version’s binaries. 😵 Distros have their mechanisms for it, but in the container world you’re kind of on your own.
Well not really … it’s just different. I found there’s a project that specializes in exactly the tooling part of the upgrade. After a little trial an error (see below) it went quite smoothly.
Procedure
In the end it came down to the following steps:
Stop the old postgres container.
Backup the old data directory (yay ZFS snapshots).
Create the new postgres container (with a new data directory; in my case via Ansible)
Stop the new postgres container.
Run the upgrade. (see command below)
Start the new postgres container.
Run vacuumdb as suggested at the end of the upgrade. (see command below)
The Upgrade Command
I used the tianon/postgres-upgrade container for the upgrade. Since my directory layout didn’t follow the “default” structure I had to mount each version’s data directory separately.
I set the POSTGRES_INITDB_ARGS to what I used when creating the new Postgres container’s data directory. This shouldn’t be necessary because we let the new Postgres container initialize the data directory. (see below) I left it in just to be safe. 🤷
I explicitly mounted something to the container’s /var/lib/postgresql directory in order to have access to the upgrade logs which are mentioned in error messages. (see below)
The Vacuumdb Command
Upgrading finishes with a suggestion like:
Upgrade Complete —————- Optimizer statistics are not transferred by pg_upgrade. Once you start the new server, consider running: /usr/lib/postgresql/14/bin/vacuumdb –all –analyze-in-stages
We can run the command in the new Postgres container:
When you’re not using the default directory structure there’re some pitfalls. Mounting the two versions’ data directories separately is easy enough … it says so in the README. It’s what it doesn’t say that makes it more difficult than necessary. 😞
Errors When Initializing the New Data Directory
The first error I encountered was that the new data directory would get initialized with the default initdb options. where I used an optimized cargo-culted incantation which was incompatible (in my case --no-locale --encoding=UTF8). The upgrade failed with the following error:
lc_collate values for database “postgres” do not match: old “C”, new “en_US.utf8”
So I made sure I created the new database container (with the correct initdb args) before the migration fixed this.
Extra Mounts for the Upgrade
What tripped me really up was that when something failed it said to look into a specific log file which I couldn’t find. 🤨 I had to also mount something to the /var/lib/postgres directory which then had all the upgrade log files. 😔
This also solved another of my problems where the upgrade tool wanted to start an instance of the Postgres database, but failed because it couldn’t find a specific socket … which also happens to be located in the directory mentioned above.
Authentication Errors After Upgrade
After the upgrade I had a lot of authentication errors although non of the passwords should have changed.
FATAL: password authentication failed for user “nextcloud”
After digging through the internet and comparing both the old and new data directories it looked like the password hashing method changed. It changed from md5 to scram-sha-256 (in pg_ hda.conf the line saying host all all all scram-sha-256). 😑Just re-setting (i.e. setting the same passwords again) via ALTER ROLE foo SET PASSWORD '...'; on all users fixed the issue.🤐
When I started playing with LXD I just accepted the default storage configuration which creates an image file and uses that to initialize a ZFS pool. Since I’m using ZFS as my main file system this seemed silly as LXD can use an existing dataset as a source for a storage pool. So I wanted to migrate my existing containers to the new storage pool.
Although others seemed to to have the same problem there was no ready answer. Digging through the documentation I finally found out that the lxc move command had a -s option … I had an idea. ? Here’s what I came up with …
Preparation
First we create the dataset on the existing ZFS pool and add it to LXC.
pool1 is the old pool backed by the image file and is used by some containers at the moment as can be seen in the “Used By” column. pool2 is added by not used by any contaiers yet.
Moving
We now try to move our containers to pool2.
# move container to pool2
lxc move some_container some_container-moved -s=pool2
# rename container back for sanity ;)
lxc move some_container-moved some_container
We can check with lxc storage list whether we succeeded.
If you’re an Ansible user and encounter the following error:
unix_listener: "..." too long for Unix domain socket
you need to set the control_path option in your ansible.cfg file to tell SSH to use shorter path names for the control socket. You should have a look at the ssh_config(5) man page (under
In the company I work for we’re using RabbitMQ to offload non-timecritical processing of tasks. To be able to recover in case RabbitMQ goes down our queues are durable and all our messages are marked as persistent. We generally have a very low number of messages in flight at any moment in time. There’s just one queue with a decent amount of them: the “failed messages” dump.
The Problem
It so happens that after a botched update to the most recent version of RabbitMQ (3.5.3 at the time) our admins had to nuke the server and install it from scratch. They had made a backup of RabbitMQ’s Mnesia database and I was tasked to recover the messages from it.
This is the story of how I did it.
Since our RabbitMQ was configured to persist all the messages this should be generally possible. Surely I wouldn’t be the first one to attempt this. ?
Looking through the Internet it seems there’s no way of ex/importing a node’s configuration if it’s not running. I couldn’t find any documentation on how to import a Mnesia backup into a new node or extract data from it into a usable form. ?
The Idea
My idea was to setup a virtual machine (running DebianWheezy) with RabbitMQ and then to somehow make it read/recover and run the broken server’s database.
In the following you’ll see the following placeholders:
My first try was to just copy the broken node’s Mnesia files to the VM’s $RABBITMQ_MNESIA_DIR failed. The files contained node names that RabbitMQ tried to reach but were unreachable from the VM.
Error description:
{could_not_start,rabbit,
{{failed_to_cluster_with,
['$BROKEN_NODENAME'],
"Mnesia could not connect to any nodes."},
{rabbit,start,[normal,[]]}}}
So I tried to be a little bit more picky on what I copied.
First I had to reset $RABBITMQ_MNESIA_DIR by deleting it and have RabbitMQ recreate it. (I needed to do this way too many times ?)
sudo service rabbitmq-server stop
rm -r $RABBITMQ_MNESIA_DIR
sudo service rabbitmq-server start
Stopping RabbitMQ I tried to feed it the broken server’s data in piecemeal fashion. This time I only copied the
rabbit_*.[DCD,DCL]
and restarted RabbitMQ.
RabbitMQ’s management interface lists all the queues, but it thinks the node they’re on is “down”
Looking at the web management interface there were all the queues we were missing, but they were “down” and clicking on them told you
The object you clicked on was not found; it may have been deleted on the server.
Copying any more data didn’t solve the issue. So this was a dead end. ?
2nd Try
So I thought why doesn’t the RabbitMQ in the VM pretend to be the exact same node as on the broken server?
So I created a
/etc/rabbitmq/rabbitmq-env.conf
with
NODENAME=$BROKEN_NODENAME
in there.
I copied the backup to $RABBITMQ_MNESIA_DIR (now with the new node name) and fixed the permissions.
Now starting RabbitMQ failed with
ERROR: epmd error for host $BROKEN_HOST: nxdomain (non-existing domain)
I edited
/etc/hosts
to add $BROKEN_HOST to the list of names that resolve to 127.0.0.1.
Now restarting RabbitMQ failed with yet another error:
Now what? Why don’t I try to give it the Mnesia files piece by piece again?
Reset $RABBITMQ_MNESIA_DIR
Stop RabbitMQ
Copy
rabbit_*
files in again and fix their permissions
Start RabbitMQ
All our queues were back and all their configuration seemed OK as well. But we still didn’t have our messages back yet.
The queues have been restored, but they have no messages in them
Solution
So I tried to copy more and more files over from the backup repeating the above steps. I finally reached my goal after copying
rabbit_*
,
msg_store_*
,
queues
and
recovery.dets
. Fixing their permissions and starting RabbitMQ it had all the queues restored with all the messages in them. ?
Queues and messages restored
Now I could use ordinary methods to extract all the messages. Dumping all the messages and examining them they looked OK. Publishing the recovered messages to the new server I was pretty euphoric. ?